Skip to content

[APMSVLS-501] refactor(bottlecap): preparatory work for testmode binary#1201

Open
lucaspimentel wants to merge 28 commits intomainfrom
lpimentel/bottlecap-test-mode
Open

[APMSVLS-501] refactor(bottlecap): preparatory work for testmode binary#1201
lucaspimentel wants to merge 28 commits intomainfrom
lpimentel/bottlecap-test-mode

Conversation

@lucaspimentel
Copy link
Copy Markdown
Member

@lucaspimentel lucaspimentel commented Apr 23, 2026

Overview

Preparatory refactors for the upcoming bottlecap-testmode binary (APMSVLS-501), which reuses TraceAgent / handle_traces but has no Lambda lifecycle to drive. Three logical changes:

1. InvocationProcessorHandle::noop()

Constructor backed by a background task that acknowledges every ProcessorCommand with a sensible default so callers never block on their response oneshots:

  • Request-response commands (GetReparentingInfo, UpdateReparenting, SetColdStartSpanTraceId, PlatformRuntimeDone, PlatformReport) reply with empty/default values.
  • Fire-and-forget commands are dropped silently.

The match is exhaustive: adding a new ProcessorCommand variant will cause a compile error here, forcing test-mode behavior to be decided explicitly. A response-carrying variant placed in the fire-and-forget arm would silently drop its sender, causing the caller to receive ProcessorError::ChannelReceive instead of the intended default.

The noop channel capacity matches the real InvocationProcessorService to avoid backpressure surprises.

2. Promote start_trace_agent to a top-level library module

Moved the start_trace_agent helper from the private bin/bottlecap/main.rs into a new public module, bottlecap::startup, so callers outside the Lambda binary can construct the full trace-processing pipeline through a single call instead of duplicating ~110 lines of wiring. Placed at the crate root (not under traces/) because the module wires trace, stats, proxy, config, lifecycle, tags, appsec, and flushing pieces together: it is cross-cutting orchestration, not a trace-domain API. No behavior change; the Lambda binary's call site is unchanged (same function name, same signature, same internal spawn).

The companion TraceAgentPipeline return type is a pub struct with named pub fields (trace/stats/proxy flushers, the trace-channel sender, the shutdown token, and the aggregator/concentrator handles), rather than a positional tuple.

3. RouterExtension seam on TraceAgent

Adds a generic extension point that lets a caller merge additional axum routes into the trace-agent's HTTP router without TraceAgent knowing what those routes do. When no caller attaches an extension (the Lambda binary case), the HTTP surface on port 8126 is unchanged.

  • New pub trait RouterExtension: Send + Sync { fn extend(&self, router: Router) -> Result<Router, Box<dyn Error>>; } defined in traces::trace_agent. Returning Err aborts agent startup and surfaces the error in the existing log path, preventing silent failures from a panicking or misconfigured extension.
  • TraceAgent holds router_extension: Option<Arc<dyn RouterExtension>> with a consuming builder method with_router_extension(self, ext: Arc<dyn RouterExtension>) -> Self.
  • make_router calls extension.extend(router)? when the field is set, before applying the outer fallback and body-limit layers.
  • Split the library helper in startup.rs into build_trace_agent (returns an unspawned TraceAgent plus the TraceAgentPipeline handles) and a thin start_trace_agent wrapper that spawns. Callers that need to attach a RouterExtension before startup use build_trace_agent; the Lambda binary's call site continues to use start_trace_agent.
  • A #[cfg(test)]-only SpyExtension validates the seam end-to-end: the trait shape compiles, state is carried via Arc, and merged routes are reachable through the composed Router returned by make_router.

Any concrete flush/drain/diagnostic endpoint lives behind an impl of this trait in the consumer crate, not in trace_agent.rs.

Testing

  • cargo check --bin bottlecap
  • cargo check --lib
  • cargo test --lib
  • cargo clippy --workspace --all-targets --features default -- -D warnings
  • cargo fmt --all -- --check
  • cargo nextest run --workspace (532/532 passed)
  • New unit tests on InvocationProcessorHandle::noop(): noop_request_response_methods_return_defaults, noop_fire_and_forget_commands_do_not_panic, noop_platform_runtime_done_and_report_respond_without_blocking, noop_request_response_variants_complete_within_timeout
  • New unit tests on the RouterExtension seam: with_router_extension_adds_reachable_route_to_make_router, make_router_returns_404_for_extension_route_when_none_attached

"Why did the noop handler go to therapy? Because it had to acknowledge every command without ever really doing anything." — Claude 🤖

@lucaspimentel lucaspimentel changed the title feat(bottlecap): add InvocationProcessorHandle::noop() for test-mode binary [APMSVLS-511] Add InvocationProcessorHandle::noop() for test-mode binary Apr 23, 2026
@lucaspimentel lucaspimentel changed the title [APMSVLS-511] Add InvocationProcessorHandle::noop() for test-mode binary [APMSVLS-511] Add InvocationProcessorHandle::noop() for test-mode binary Apr 23, 2026
@lucaspimentel lucaspimentel changed the title [APMSVLS-511] Add InvocationProcessorHandle::noop() for test-mode binary [APMSVLS-511] refactor(bottlecap): preparatory work for testmode binary Apr 23, 2026
@lucaspimentel lucaspimentel changed the title [APMSVLS-511] refactor(bottlecap): preparatory work for testmode binary [APMSVLS-501] refactor(bottlecap): preparatory work for testmode binary Apr 23, 2026
@lucaspimentel lucaspimentel marked this pull request as ready for review April 23, 2026 22:15
@lucaspimentel lucaspimentel requested a review from a team as a code owner April 23, 2026 22:15
Copilot AI review requested due to automatic review settings April 23, 2026 22:15
@lucaspimentel lucaspimentel requested a review from a team as a code owner April 23, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Preparatory refactors to enable an upcoming bottlecap-testmode binary to reuse the trace-processing pipeline (TraceAgent / handle_traces) without relying on the Lambda lifecycle, and to optionally expose a deterministic flush hook for test harnesses.

Changes:

  • Added InvocationProcessorHandle::noop() backed by a background task that acknowledges all ProcessorCommand variants with sensible defaults (plus unit tests).
  • Promoted start_trace_agent wiring from the Lambda binary into a reusable library module (bottlecap::traces::startup), including a build_trace_agent variant that returns an unspawned TraceAgent.
  • Added an opt-in POST /flush route on the trace-agent listener when a FlushingService is attached via TraceAgent::with_flushing_service.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
bottlecap/src/traces/trace_agent.rs Adds optional FlushingService plumbing and conditionally registers POST /flush.
bottlecap/src/traces/startup.rs Introduces shared library helpers to build/spawn the trace-agent pipeline.
bottlecap/src/traces/mod.rs Exposes the new startup module and re-exports pipeline helpers/types.
bottlecap/src/lifecycle/invocation/processor_service.rs Adds InvocationProcessorHandle::noop() plus tests to ensure defaults/non-blocking behavior.
bottlecap/src/bin/bottlecap/main.rs Switches Lambda binary to use the new library start_trace_agent helper and removes duplicated wiring.

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
Comment thread bottlecap/src/lifecycle/invocation/processor_service.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 723059ecf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bottlecap/src/traces/trace_agent.rs Outdated
Comment thread bottlecap/src/startup.rs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is binary specific code, not traces component related code, you could create a file/module in the bottlecap binary folder instead of here, those methods are supposed to be binary specific, this breaks that pattern

Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Appreciate the thorough PR description and the preparatory structure here. The noop() constructor and build_/start_trace_agent split are solid ideas. However, there are several Rust style and structural issues worth addressing before this merges, especially since this is establishing patterns that the test-mode binary will build on.


1. TraceAgentPipeline — 8-element tuple as public API

pub type TraceAgentPipeline = (
    Sender<SendDataBuilderInfo>,
    Arc<trace_flusher::TraceFlusher>,
    Arc<trace_processor::ServerlessTraceProcessor>,
    Arc<stats_flusher::StatsFlusher>,
    Arc<ProxyFlusher>,
    CancellationToken,
    StatsConcentratorHandle,
    TraceAggregatorHandle,
);

In Rust, tuples beyond 3-4 elements are a code smell. At every call site, someone destructures by position — if the order changes, or two fields happen to share a type, it silently breaks. Now that this is being promoted from a private function's return type to a pub type alias in the library, it should be a struct with named fields:

pub struct TraceAgentPipeline {
    pub trace_channel: Sender<SendDataBuilderInfo>,
    pub trace_flusher: Arc<trace_flusher::TraceFlusher>,
    pub trace_processor: Arc<trace_processor::ServerlessTraceProcessor>,
    // ...
}

Named fields are self-documenting, order-independent, and play nicely with IDE autocompletion. The original code had this problem, but promoting it to library API is the right time to fix it.


2. startup.rs doesn't belong in traces/

(Expanding on Jordan's inline comment.)

build_trace_agent imports from config, lifecycle, tags, traces, appsec, and flushing — it touches nearly every module in the crate. It's application-level wiring/orchestration, not a traces component. Placing it in traces::startup misrepresents its role and breaks the pattern where traces/ contains trace-domain code.

The existing pattern is that each binary owns its startup wiring in bin/<name>/main.rs. If sharing between the Lambda and test-mode binaries is truly needed, consider:

  • A top-level bottlecap::startup module (not nested under a domain module)
  • Or a shared helper in bin/ if both binaries live in the same crate

Re-exporting via pub use startup::{...} from traces::mod.rs further blurs the boundary — callers see bottlecap::traces::start_trace_agent and assume it's trace-domain API.


3. flush handler — over-engineered timeout logic

The current code (after accepting Copilot's suggestion):

Err(_) => {
    task.abort();
    match task.await {
        Err(join_err) if join_err.is_cancelled() => StatusCode::GATEWAY_TIMEOUT,
        Err(_) => StatusCode::INTERNAL_SERVER_ERROR,
        Ok(()) => StatusCode::GATEWAY_TIMEOUT,
    }
}

After task.abort(), the await will always resolve to Err(JoinError::Cancelled) (the Ok(()) arm is only reachable in a vanishingly small race window, and even then the right status is still 504). The inner match has 3 arms that all effectively return the same thing. Simplify to:

Err(_) => {
    task.abort();
    StatusCode::GATEWAY_TIMEOUT
}

We don't need to await after abort — we already know the status code. Simpler code is also more correct code because there are fewer branches to reason about.


4. Noop tests — reach through internals instead of using the public API

The third test (noop_platform_runtime_done_and_report_respond_without_blocking) is 50+ lines that manually construct ProcessorCommand::PlatformRuntimeDone and ProcessorCommand::PlatformReport, sending them through handle.sender directly. But InvocationProcessorHandle already has public methods for both:

  • handle.on_platform_runtime_done(...)
  • handle.on_platform_report(...)

The test should use those. Testing through the public API is more resilient to internal refactors (e.g., if the channel type changes) and is more readable — the reader immediately understands what behavior is being tested rather than having to mentally map command construction to handler behavior. It also avoids the 15+ lines of field construction for each command variant.


5. Doc comments leak consumer knowledge into library functions

build_trace_agent's doc references "bottlecap-testmode binary" and "POST /flush route" repeatedly. Library functions should document what they do and why a caller would use them, not who specifically calls them. Consumer-specific guidance belongs in the consumer's code. For example:

"Returns an unspawned TraceAgent so the caller can configure it (e.g., attach optional routes) before spawning."

is better than:

"notably the bottlecap-testmode binary, which uses it to back a POST /flush route"


6. flushing_service: Option<Arc<FlushingService>> on TraceAgent

Adding a test-only field to a production struct is pragmatic for one field, but worth flagging as a pattern to not repeat. Every TraceAgent in production carries a None that exists solely for test-mode. If more test-mode hooks are needed later, consider a trait or a router-extension pattern instead of accumulating Option fields.


Summary

Area Suggestion
TraceAgentPipeline Replace 8-tuple with a named struct
startup.rs location Move out of traces/, it's orchestration not trace-domain code
flush timeout Simplify the post-abort match to just return 504
Noop tests Use handle.on_platform_runtime_done() / on_platform_report() instead of raw channel sends
Doc comments Remove consumer-specific references from library function docs
flushing_service field Fine for now, but flag as not-to-repeat pattern

Happy to discuss any of these — the overall direction is good, these are about getting the Rust patterns right before the test-mode binary builds on top.

lucaspimentel added a commit that referenced this pull request Apr 24, 2026
Tuples beyond 3-4 fields are fragile: every call site destructures by
position, and any reordering (or two fields of the same type swapping)
breaks silently. Now that the pipeline is a pub API type shared between
the Lambda binary and future test-mode binary, promote it to a struct
with named pub fields.

Addresses review feedback on PR #1201.
lucaspimentel added a commit that referenced this pull request Apr 24, 2026
This module wires trace, stats, proxy, config, lifecycle, tags, appsec,
and flushing pieces together. It is cross-cutting orchestration, not
trace-domain code, so placing it under traces/ misrepresents its role
and makes callers treat bottlecap::traces::start_trace_agent as if it
were a trace-domain API.

Promote it to bottlecap::startup at the crate root so both the Lambda
binary and the forthcoming test-mode binary can consume it without
reaching into a domain module.

Addresses review feedback on PR #1201.
lucaspimentel added a commit that referenced this pull request Apr 24, 2026
After task.abort(), the awaited JoinError is effectively always
Cancelled, and all three arms of the post-abort match returned (or
should have returned) 504 for the test harness. The extra match adds
branches to reason about without changing the result, and the lone
Err(_) -> 500 arm is unreachable in practice (a panic surfaces through
the Ok(Err(_)) branch of the outer timeout). Drop the await-after-abort
and return 504 directly.

Addresses review feedback on PR #1201.
lucaspimentel added a commit that referenced this pull request Apr 24, 2026
The test previously built ProcessorCommand::PlatformRuntimeDone and
PlatformReport by hand and pushed them through handle.sender, making
the reader mentally map raw command construction to the handler
behavior and coupling the test to the channel internals. Use the
already-public InvocationProcessorHandle methods, which exercise the
same code path, drop ~15 lines of field plumbing per variant, and
stay resilient to future refactors of the command channel.

Addresses review feedback on PR #1201.
lucaspimentel added a commit that referenced this pull request Apr 24, 2026
Library doc comments named the bottlecap-testmode binary and the
POST /flush route it registers. Library APIs should document what they
do and why a caller might reach for them, not which specific caller
today happens to. Rewrite the affected doc comments (and one internal
impl comment) to describe behavior and motivation generically: "a
deterministic on-demand drain hook", "a handle that has no Lambda
lifecycle state to drive", and so on.

Addresses review feedback on PR #1201.
lucaspimentel added a commit that referenced this pull request Apr 24, 2026
…ension trait

The flushing_service: Option<Arc<FlushingService>> field was test-mode
scaffolding on the production struct. Every TraceAgent in production
carried a None that existed solely for a future test binary, and any
additional test-mode hook would have had to add another Option field
with the same shape, each pulling another consumer-specific dependency
into trace_agent.rs.

Replace it with a single generic extension seam:

- pub trait RouterExtension { fn extend(&self, router: Router) -> Router; }
- TraceAgent stores Option<Arc<dyn RouterExtension>> and exposes
  with_router_extension(self, ext) -> Self in place of
  with_flushing_service.
- make_router calls extension.extend(router) when set, before applying
  the outer fallback and body-limit layers.
- FLUSH_ENDPOINT_PATH, the flush handler, and the crate::flushing
  FlushingService import are removed from trace_agent.rs. The trace-
  agent module now knows nothing about FlushingService.

Future test-mode hooks add new methods (with defaults) to the trait or
new impls behind it, not new Option fields on TraceAgent.

A #[cfg(test)]-only SpyExtension validates the seam end-to-end: trait
shape compiles, state is carried via Arc, and merged routes are
reachable through the composed Router.

Addresses review feedback on PR #1201 (deferred comment 6).
@lucaspimentel
Copy link
Copy Markdown
Member Author

lucaspimentel commented Apr 24, 2026

@duncanista, all six items addressed across six commits on top of 605933a:

  1. TraceAgentPipeline tuple → struct with named pub fields (6586fc5). Destructure by name at the call site, so reordering or same-typed-field swaps can't break silently.

  2. startup.rs moved out of traces/ to a new top-level bottlecap::startup (80e7752). The Lambda binary calls bottlecap::startup::start_trace_agent now, no longer reaching into a domain module; the re-export from traces::mod.rs is gone.

  3. POST /flush timeout arm collapsed to task.abort(); 504 (b043eb8). The redundant await-after-abort and the unreachable 500 branch are gone; a panic still surfaces through the outer Ok(Err(_)) arm as 500.

  4. Noop test uses handle.on_platform_runtime_done / on_platform_report (962f394) instead of constructing ProcessorCommand variants by hand. About 15 lines of per-variant field plumbing removed.

  5. Consumer-specific references stripped from library doc comments (5fa5bb2). No more bottlecap-testmode or POST /flush route naming in build_trace_agent, with_router_extension, InvocationProcessorHandle::noop, etc. The library docs describe behavior, not today's caller.

  6. flushing_service field replaced with a RouterExtension trait (043d991, bc31322). Swapped the Option<Arc<FlushingService>> field for a single generic slot Option<Arc<dyn RouterExtension>> with pub trait RouterExtension { fn extend(&self, router: Router) -> Router; }. TraceAgent no longer imports or references FlushingService; the flush handler and FLUSH_ENDPOINT_PATH const are gone. Future test-mode hooks become new trait impls (or default-implemented methods added to the trait), not new Option fields. A cfg(test) test builds a real TraceAgent, attaches a spy extension via with_router_extension, and verifies the route is reachable through make_router; a second test confirms the None path still falls through to 404.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread bottlecap/src/startup.rs
Comment on lines +68 to +76
// Build one shared hyper-based HTTP client for trace and stats flushing.
// This client type is required by libdd_trace_utils for SendData::send().
let trace_http_client = trace_http_client::create_client(
config.proxy_https.as_ref(),
config.tls_cert_file.as_ref(),
config.skip_ssl_validation,
)
.expect("Failed to create trace HTTP client");

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_trace_agent is a public library API but it uses .expect("Failed to create trace HTTP client"), which will panic on common misconfigurations (invalid proxy URL/cert file, IO errors). Consider returning a Result<(TraceAgent, TraceAgentPipeline), Box<dyn Error + Send + Sync>> (and similarly making start_trace_agent return Result<TraceAgentPipeline, _>) so callers can handle startup failures without aborting the process.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +152
/// Returns a handle backed by a background task that acknowledges every
/// command with a sensible default. Use this for callers that need an
/// `InvocationProcessorHandle` (for example, to reuse `TraceAgent` or
/// `handle_traces`) but have no Lambda lifecycle state to drive.
///
/// The match below is exhaustive: adding a new `ProcessorCommand` variant
/// will fail to compile here, so the noop behavior for it must be
/// decided explicitly. A response-carrying variant placed in the
/// fire-and-forget arm would silently drop its sender, causing the
/// caller to receive `ProcessorError::ChannelReceive` instead of the
/// intended default.
#[must_use]
pub fn noop() -> Self {
let (sender, mut receiver) = mpsc::channel::<ProcessorCommand>(1000);
tokio::spawn(async move {
while let Some(command) = receiver.recv().await {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvocationProcessorHandle::noop() spawns a Tokio task internally. If this constructor is called outside of an active Tokio runtime, tokio::spawn will panic at runtime. Please either document this requirement explicitly in the doc comment (e.g., "must be called from within a Tokio runtime") or provide an alternative constructor that takes a tokio::runtime::Handle (or is async) to avoid surprising panics for library callers.

Copilot uses AI. Check for mistakes.
lucaspimentel and others added 5 commits April 24, 2026 19:34
The forthcoming bottlecap-testmode binary (APMSVLS-511) reuses TraceAgent
and handle_traces but has no Lambda lifecycle to drive. Adds a noop
constructor that spawns a background task acknowledging every
ProcessorCommand with a sensible default, so callers never block on
their response oneshots.

The match is exhaustive: a new ProcessorCommand variant forces a compile
error here, keeping test-mode behavior explicit rather than silently
dropping responses.
…isfy clippy

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Moves the start_trace_agent helper from the private bin/bottlecap/main.rs
into a new public module, bottlecap::traces::startup, so callers outside
the Lambda binary (notably the forthcoming bottlecap-testmode binary)
can construct the full trace-processing pipeline through a single call
instead of duplicating ~110 lines of wiring.

No behavior change. The Lambda binary's call site is unchanged: same
function name, same signature, same internal spawn.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…ode /flush hook

Adds an opt-in POST /flush route to the trace-agent listener on 8126,
backed by a caller-supplied FlushingService. The route is only
registered when the caller calls TraceAgent::with_flushing_service;
otherwise the Lambda binary's HTTP surface is unchanged.

- Add an optional flushing_service field to TraceAgent with a
  consuming builder method with_flushing_service(self, fs) -> Self.
- In make_router, conditionally merge a /flush sub-router that calls
  FlushingService::flush_blocking_final when the field is set.
- Split the library helper in src/traces/startup.rs into
  build_trace_agent (returns an unspawned TraceAgent plus the pipeline
  handles) and a thin start_trace_agent wrapper that spawns. Test-mode
  will use build_trace_agent so it can attach its FlushingService
  before spawning; Lambda's call site is unchanged.
- No behavior change for the Lambda binary: it never attaches a
  flushing service, so /flush remains unexposed and start_trace_agent
  still spawns internally.

Preparatory for APMSVLS-511 (bottlecap-testmode binary).

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
lucaspimentel and others added 23 commits April 24, 2026 19:34
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
build_trace_agent returns (TraceAgent, TraceAgentPipeline), a 2-tuple
whose complexity is hidden behind the type alias — the lint never fires.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Callers using build_trace_agent need the return type alias accessible
via the same crate::traces path as the function itself.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Wrap flush_blocking_final in a spawned task so panics return 500
instead of tearing down the handler, and add a 30-second timeout
so a stuck flush returns 504 rather than hanging the test harness.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Add a test that sends PlatformRuntimeDone and PlatformReport commands
directly through the noop handle's internal channel and verifies the
response oneshots are fulfilled without blocking.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Tuples beyond 3-4 fields are fragile: every call site destructures by
position, and any reordering (or two fields of the same type swapping)
breaks silently. Now that the pipeline is a pub API type shared between
the Lambda binary and future test-mode binary, promote it to a struct
with named pub fields.

Addresses review feedback on PR #1201.
This module wires trace, stats, proxy, config, lifecycle, tags, appsec,
and flushing pieces together. It is cross-cutting orchestration, not
trace-domain code, so placing it under traces/ misrepresents its role
and makes callers treat bottlecap::traces::start_trace_agent as if it
were a trace-domain API.

Promote it to bottlecap::startup at the crate root so both the Lambda
binary and the forthcoming test-mode binary can consume it without
reaching into a domain module.

Addresses review feedback on PR #1201.
After task.abort(), the awaited JoinError is effectively always
Cancelled, and all three arms of the post-abort match returned (or
should have returned) 504 for the test harness. The extra match adds
branches to reason about without changing the result, and the lone
Err(_) -> 500 arm is unreachable in practice (a panic surfaces through
the Ok(Err(_)) branch of the outer timeout). Drop the await-after-abort
and return 504 directly.

Addresses review feedback on PR #1201.
The test previously built ProcessorCommand::PlatformRuntimeDone and
PlatformReport by hand and pushed them through handle.sender, making
the reader mentally map raw command construction to the handler
behavior and coupling the test to the channel internals. Use the
already-public InvocationProcessorHandle methods, which exercise the
same code path, drop ~15 lines of field plumbing per variant, and
stay resilient to future refactors of the command channel.

Addresses review feedback on PR #1201.
Library doc comments named the bottlecap-testmode binary and the
POST /flush route it registers. Library APIs should document what they
do and why a caller might reach for them, not which specific caller
today happens to. Rewrite the affected doc comments (and one internal
impl comment) to describe behavior and motivation generically: "a
deterministic on-demand drain hook", "a handle that has no Lambda
lifecycle state to drive", and so on.

Addresses review feedback on PR #1201.
…ension trait

The flushing_service: Option<Arc<FlushingService>> field was test-mode
scaffolding on the production struct. Every TraceAgent in production
carried a None that existed solely for a future test binary, and any
additional test-mode hook would have had to add another Option field
with the same shape, each pulling another consumer-specific dependency
into trace_agent.rs.

Replace it with a single generic extension seam:

- pub trait RouterExtension { fn extend(&self, router: Router) -> Router; }
- TraceAgent stores Option<Arc<dyn RouterExtension>> and exposes
  with_router_extension(self, ext) -> Self in place of
  with_flushing_service.
- make_router calls extension.extend(router) when set, before applying
  the outer fallback and body-limit layers.
- FLUSH_ENDPOINT_PATH, the flush handler, and the crate::flushing
  FlushingService import are removed from trace_agent.rs. The trace-
  agent module now knows nothing about FlushingService.

Future test-mode hooks add new methods (with defaults) to the trait or
new impls behind it, not new Option fields on TraceAgent.

A #[cfg(test)]-only SpyExtension validates the seam end-to-end: trait
shape compiles, state is carried via Arc, and merged routes are
reachable through the composed Router.

Addresses review feedback on PR #1201 (deferred comment 6).
The previous SpyExtension test called `extension.extend(Router::new())`
directly, which only proves axum's Router::merge works. Build an actual
TraceAgent via a small test helper, attach the spy via
with_router_extension, and hit /spy through the full router returned by
make_router. Add a second test confirming that with no extension
attached, /spy falls through to the 404 handler.

Also trim the "deterministic drain endpoint" example from the
with_router_extension and start_trace_agent doc blocks; keep it only
in the trait doc to avoid repetition across three locations.
…comment

Dropping a oneshot sender causes the receiver to return
ProcessorError::ChannelReceive, not a hang. Update the comment to
describe the actual observable behavior.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…omment

The previous note mentioned only the three tasks spawned directly in
build_trace_agent (aggregator, concentrator, dedup). TraceAgent::new
unconditionally spawns a fourth task: the trace-payload drain loop.
Update the summary line and the leak warning to reflect all four tasks.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…rrors

Previously, a panic or error inside extend() would propagate as a panic
through make_router() and start(), then be silently dropped by the
tokio::spawn wrapper in start_trace_agent (which only logged Err returns,
not panics).

Changing the return type to Result<Router, Box<dyn std::error::Error>>
means implementors must return errors explicitly. Those errors propagate
through make_router via ?, then through start() to the spawn wrapper,
where they are already logged.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
… timeout

The exhaustive match in noop() prevents unhandled variants at compile time,
but it cannot prevent a future contributor from placing a response-carrying
variant in the fire-and-forget arm. In that case, rx.await returns
ProcessorError::ChannelReceive, and the test would hang rather than fail
cleanly.

Add a test that wraps every request-response call in a 500ms timeout so
any such regression produces a clear failure instead of a hung test suite.

🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
@lucaspimentel lucaspimentel force-pushed the lpimentel/bottlecap-test-mode branch from 05677cb to c825e8d Compare April 24, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants