[APMSVLS-501] refactor(bottlecap): preparatory work for testmode binary#1201
[APMSVLS-501] refactor(bottlecap): preparatory work for testmode binary#1201lucaspimentel wants to merge 28 commits intomainfrom
Conversation
InvocationProcessorHandle::noop() for test-mode binary
InvocationProcessorHandle::noop() for test-mode binaryThere was a problem hiding this comment.
Pull request overview
Preparatory refactors to enable an upcoming bottlecap-testmode binary to reuse the trace-processing pipeline (TraceAgent / handle_traces) without relying on the Lambda lifecycle, and to optionally expose a deterministic flush hook for test harnesses.
Changes:
- Added
InvocationProcessorHandle::noop()backed by a background task that acknowledges allProcessorCommandvariants with sensible defaults (plus unit tests). - Promoted
start_trace_agentwiring from the Lambda binary into a reusable library module (bottlecap::traces::startup), including abuild_trace_agentvariant that returns an unspawnedTraceAgent. - Added an opt-in
POST /flushroute on the trace-agent listener when aFlushingServiceis attached viaTraceAgent::with_flushing_service.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| bottlecap/src/traces/trace_agent.rs | Adds optional FlushingService plumbing and conditionally registers POST /flush. |
| bottlecap/src/traces/startup.rs | Introduces shared library helpers to build/spawn the trace-agent pipeline. |
| bottlecap/src/traces/mod.rs | Exposes the new startup module and re-exports pipeline helpers/types. |
| bottlecap/src/lifecycle/invocation/processor_service.rs | Adds InvocationProcessorHandle::noop() plus tests to ensure defaults/non-blocking behavior. |
| bottlecap/src/bin/bottlecap/main.rs | Switches Lambda binary to use the new library start_trace_agent helper and removes duplicated wiring. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 723059ecf6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
this is binary specific code, not traces component related code, you could create a file/module in the bottlecap binary folder instead of here, those methods are supposed to be binary specific, this breaks that pattern
duncanista
left a comment
There was a problem hiding this comment.
Review
Appreciate the thorough PR description and the preparatory structure here. The noop() constructor and build_/start_trace_agent split are solid ideas. However, there are several Rust style and structural issues worth addressing before this merges, especially since this is establishing patterns that the test-mode binary will build on.
1. TraceAgentPipeline — 8-element tuple as public API
pub type TraceAgentPipeline = (
Sender<SendDataBuilderInfo>,
Arc<trace_flusher::TraceFlusher>,
Arc<trace_processor::ServerlessTraceProcessor>,
Arc<stats_flusher::StatsFlusher>,
Arc<ProxyFlusher>,
CancellationToken,
StatsConcentratorHandle,
TraceAggregatorHandle,
);In Rust, tuples beyond 3-4 elements are a code smell. At every call site, someone destructures by position — if the order changes, or two fields happen to share a type, it silently breaks. Now that this is being promoted from a private function's return type to a pub type alias in the library, it should be a struct with named fields:
pub struct TraceAgentPipeline {
pub trace_channel: Sender<SendDataBuilderInfo>,
pub trace_flusher: Arc<trace_flusher::TraceFlusher>,
pub trace_processor: Arc<trace_processor::ServerlessTraceProcessor>,
// ...
}Named fields are self-documenting, order-independent, and play nicely with IDE autocompletion. The original code had this problem, but promoting it to library API is the right time to fix it.
2. startup.rs doesn't belong in traces/
(Expanding on Jordan's inline comment.)
build_trace_agent imports from config, lifecycle, tags, traces, appsec, and flushing — it touches nearly every module in the crate. It's application-level wiring/orchestration, not a traces component. Placing it in traces::startup misrepresents its role and breaks the pattern where traces/ contains trace-domain code.
The existing pattern is that each binary owns its startup wiring in bin/<name>/main.rs. If sharing between the Lambda and test-mode binaries is truly needed, consider:
- A top-level
bottlecap::startupmodule (not nested under a domain module) - Or a shared helper in
bin/if both binaries live in the same crate
Re-exporting via pub use startup::{...} from traces::mod.rs further blurs the boundary — callers see bottlecap::traces::start_trace_agent and assume it's trace-domain API.
3. flush handler — over-engineered timeout logic
The current code (after accepting Copilot's suggestion):
Err(_) => {
task.abort();
match task.await {
Err(join_err) if join_err.is_cancelled() => StatusCode::GATEWAY_TIMEOUT,
Err(_) => StatusCode::INTERNAL_SERVER_ERROR,
Ok(()) => StatusCode::GATEWAY_TIMEOUT,
}
}After task.abort(), the await will always resolve to Err(JoinError::Cancelled) (the Ok(()) arm is only reachable in a vanishingly small race window, and even then the right status is still 504). The inner match has 3 arms that all effectively return the same thing. Simplify to:
Err(_) => {
task.abort();
StatusCode::GATEWAY_TIMEOUT
}We don't need to await after abort — we already know the status code. Simpler code is also more correct code because there are fewer branches to reason about.
4. Noop tests — reach through internals instead of using the public API
The third test (noop_platform_runtime_done_and_report_respond_without_blocking) is 50+ lines that manually construct ProcessorCommand::PlatformRuntimeDone and ProcessorCommand::PlatformReport, sending them through handle.sender directly. But InvocationProcessorHandle already has public methods for both:
handle.on_platform_runtime_done(...)handle.on_platform_report(...)
The test should use those. Testing through the public API is more resilient to internal refactors (e.g., if the channel type changes) and is more readable — the reader immediately understands what behavior is being tested rather than having to mentally map command construction to handler behavior. It also avoids the 15+ lines of field construction for each command variant.
5. Doc comments leak consumer knowledge into library functions
build_trace_agent's doc references "bottlecap-testmode binary" and "POST /flush route" repeatedly. Library functions should document what they do and why a caller would use them, not who specifically calls them. Consumer-specific guidance belongs in the consumer's code. For example:
"Returns an unspawned
TraceAgentso the caller can configure it (e.g., attach optional routes) before spawning."
is better than:
"notably the bottlecap-testmode binary, which uses it to back a POST /flush route"
6. flushing_service: Option<Arc<FlushingService>> on TraceAgent
Adding a test-only field to a production struct is pragmatic for one field, but worth flagging as a pattern to not repeat. Every TraceAgent in production carries a None that exists solely for test-mode. If more test-mode hooks are needed later, consider a trait or a router-extension pattern instead of accumulating Option fields.
Summary
| Area | Suggestion |
|---|---|
TraceAgentPipeline |
Replace 8-tuple with a named struct |
startup.rs location |
Move out of traces/, it's orchestration not trace-domain code |
flush timeout |
Simplify the post-abort match to just return 504 |
| Noop tests | Use handle.on_platform_runtime_done() / on_platform_report() instead of raw channel sends |
| Doc comments | Remove consumer-specific references from library function docs |
flushing_service field |
Fine for now, but flag as not-to-repeat pattern |
Happy to discuss any of these — the overall direction is good, these are about getting the Rust patterns right before the test-mode binary builds on top.
Tuples beyond 3-4 fields are fragile: every call site destructures by position, and any reordering (or two fields of the same type swapping) breaks silently. Now that the pipeline is a pub API type shared between the Lambda binary and future test-mode binary, promote it to a struct with named pub fields. Addresses review feedback on PR #1201.
This module wires trace, stats, proxy, config, lifecycle, tags, appsec, and flushing pieces together. It is cross-cutting orchestration, not trace-domain code, so placing it under traces/ misrepresents its role and makes callers treat bottlecap::traces::start_trace_agent as if it were a trace-domain API. Promote it to bottlecap::startup at the crate root so both the Lambda binary and the forthcoming test-mode binary can consume it without reaching into a domain module. Addresses review feedback on PR #1201.
After task.abort(), the awaited JoinError is effectively always Cancelled, and all three arms of the post-abort match returned (or should have returned) 504 for the test harness. The extra match adds branches to reason about without changing the result, and the lone Err(_) -> 500 arm is unreachable in practice (a panic surfaces through the Ok(Err(_)) branch of the outer timeout). Drop the await-after-abort and return 504 directly. Addresses review feedback on PR #1201.
The test previously built ProcessorCommand::PlatformRuntimeDone and PlatformReport by hand and pushed them through handle.sender, making the reader mentally map raw command construction to the handler behavior and coupling the test to the channel internals. Use the already-public InvocationProcessorHandle methods, which exercise the same code path, drop ~15 lines of field plumbing per variant, and stay resilient to future refactors of the command channel. Addresses review feedback on PR #1201.
Library doc comments named the bottlecap-testmode binary and the POST /flush route it registers. Library APIs should document what they do and why a caller might reach for them, not which specific caller today happens to. Rewrite the affected doc comments (and one internal impl comment) to describe behavior and motivation generically: "a deterministic on-demand drain hook", "a handle that has no Lambda lifecycle state to drive", and so on. Addresses review feedback on PR #1201.
…ension trait
The flushing_service: Option<Arc<FlushingService>> field was test-mode
scaffolding on the production struct. Every TraceAgent in production
carried a None that existed solely for a future test binary, and any
additional test-mode hook would have had to add another Option field
with the same shape, each pulling another consumer-specific dependency
into trace_agent.rs.
Replace it with a single generic extension seam:
- pub trait RouterExtension { fn extend(&self, router: Router) -> Router; }
- TraceAgent stores Option<Arc<dyn RouterExtension>> and exposes
with_router_extension(self, ext) -> Self in place of
with_flushing_service.
- make_router calls extension.extend(router) when set, before applying
the outer fallback and body-limit layers.
- FLUSH_ENDPOINT_PATH, the flush handler, and the crate::flushing
FlushingService import are removed from trace_agent.rs. The trace-
agent module now knows nothing about FlushingService.
Future test-mode hooks add new methods (with defaults) to the trait or
new impls behind it, not new Option fields on TraceAgent.
A #[cfg(test)]-only SpyExtension validates the seam end-to-end: trait
shape compiles, state is carried via Arc, and merged routes are
reachable through the composed Router.
Addresses review feedback on PR #1201 (deferred comment 6).
|
@duncanista, all six items addressed across six commits on top of
|
| // Build one shared hyper-based HTTP client for trace and stats flushing. | ||
| // This client type is required by libdd_trace_utils for SendData::send(). | ||
| let trace_http_client = trace_http_client::create_client( | ||
| config.proxy_https.as_ref(), | ||
| config.tls_cert_file.as_ref(), | ||
| config.skip_ssl_validation, | ||
| ) | ||
| .expect("Failed to create trace HTTP client"); | ||
|
|
There was a problem hiding this comment.
build_trace_agent is a public library API but it uses .expect("Failed to create trace HTTP client"), which will panic on common misconfigurations (invalid proxy URL/cert file, IO errors). Consider returning a Result<(TraceAgent, TraceAgentPipeline), Box<dyn Error + Send + Sync>> (and similarly making start_trace_agent return Result<TraceAgentPipeline, _>) so callers can handle startup failures without aborting the process.
| /// Returns a handle backed by a background task that acknowledges every | ||
| /// command with a sensible default. Use this for callers that need an | ||
| /// `InvocationProcessorHandle` (for example, to reuse `TraceAgent` or | ||
| /// `handle_traces`) but have no Lambda lifecycle state to drive. | ||
| /// | ||
| /// The match below is exhaustive: adding a new `ProcessorCommand` variant | ||
| /// will fail to compile here, so the noop behavior for it must be | ||
| /// decided explicitly. A response-carrying variant placed in the | ||
| /// fire-and-forget arm would silently drop its sender, causing the | ||
| /// caller to receive `ProcessorError::ChannelReceive` instead of the | ||
| /// intended default. | ||
| #[must_use] | ||
| pub fn noop() -> Self { | ||
| let (sender, mut receiver) = mpsc::channel::<ProcessorCommand>(1000); | ||
| tokio::spawn(async move { | ||
| while let Some(command) = receiver.recv().await { |
There was a problem hiding this comment.
InvocationProcessorHandle::noop() spawns a Tokio task internally. If this constructor is called outside of an active Tokio runtime, tokio::spawn will panic at runtime. Please either document this requirement explicitly in the doc comment (e.g., "must be called from within a Tokio runtime") or provide an alternative constructor that takes a tokio::runtime::Handle (or is async) to avoid surprising panics for library callers.
The forthcoming bottlecap-testmode binary (APMSVLS-511) reuses TraceAgent and handle_traces but has no Lambda lifecycle to drive. Adds a noop constructor that spawns a background task acknowledging every ProcessorCommand with a sensible default, so callers never block on their response oneshots. The match is exhaustive: a new ProcessorCommand variant forces a compile error here, keeping test-mode behavior explicit rather than silently dropping responses.
…isfy clippy 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Moves the start_trace_agent helper from the private bin/bottlecap/main.rs into a new public module, bottlecap::traces::startup, so callers outside the Lambda binary (notably the forthcoming bottlecap-testmode binary) can construct the full trace-processing pipeline through a single call instead of duplicating ~110 lines of wiring. No behavior change. The Lambda binary's call site is unchanged: same function name, same signature, same internal spawn. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…ode /flush hook Adds an opt-in POST /flush route to the trace-agent listener on 8126, backed by a caller-supplied FlushingService. The route is only registered when the caller calls TraceAgent::with_flushing_service; otherwise the Lambda binary's HTTP surface is unchanged. - Add an optional flushing_service field to TraceAgent with a consuming builder method with_flushing_service(self, fs) -> Self. - In make_router, conditionally merge a /flush sub-router that calls FlushingService::flush_blocking_final when the field is set. - Split the library helper in src/traces/startup.rs into build_trace_agent (returns an unspawned TraceAgent plus the pipeline handles) and a thin start_trace_agent wrapper that spawns. Test-mode will use build_trace_agent so it can attach its FlushingService before spawning; Lambda's call site is unchanged. - No behavior change for the Lambda binary: it never attaches a flushing service, so /flush remains unexposed and start_trace_agent still spawns internally. Preparatory for APMSVLS-511 (bottlecap-testmode binary). 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
build_trace_agent returns (TraceAgent, TraceAgentPipeline), a 2-tuple whose complexity is hidden behind the type alias — the lint never fires. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Callers using build_trace_agent need the return type alias accessible via the same crate::traces path as the function itself. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Wrap flush_blocking_final in a spawned task so panics return 500 instead of tearing down the handler, and add a 30-second timeout so a stuck flush returns 504 rather than hanging the test harness. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Add a test that sends PlatformRuntimeDone and PlatformReport commands directly through the noop handle's internal channel and verifies the response oneshots are fulfilled without blocking. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
Tuples beyond 3-4 fields are fragile: every call site destructures by position, and any reordering (or two fields of the same type swapping) breaks silently. Now that the pipeline is a pub API type shared between the Lambda binary and future test-mode binary, promote it to a struct with named pub fields. Addresses review feedback on PR #1201.
This module wires trace, stats, proxy, config, lifecycle, tags, appsec, and flushing pieces together. It is cross-cutting orchestration, not trace-domain code, so placing it under traces/ misrepresents its role and makes callers treat bottlecap::traces::start_trace_agent as if it were a trace-domain API. Promote it to bottlecap::startup at the crate root so both the Lambda binary and the forthcoming test-mode binary can consume it without reaching into a domain module. Addresses review feedback on PR #1201.
After task.abort(), the awaited JoinError is effectively always Cancelled, and all three arms of the post-abort match returned (or should have returned) 504 for the test harness. The extra match adds branches to reason about without changing the result, and the lone Err(_) -> 500 arm is unreachable in practice (a panic surfaces through the Ok(Err(_)) branch of the outer timeout). Drop the await-after-abort and return 504 directly. Addresses review feedback on PR #1201.
The test previously built ProcessorCommand::PlatformRuntimeDone and PlatformReport by hand and pushed them through handle.sender, making the reader mentally map raw command construction to the handler behavior and coupling the test to the channel internals. Use the already-public InvocationProcessorHandle methods, which exercise the same code path, drop ~15 lines of field plumbing per variant, and stay resilient to future refactors of the command channel. Addresses review feedback on PR #1201.
Library doc comments named the bottlecap-testmode binary and the POST /flush route it registers. Library APIs should document what they do and why a caller might reach for them, not which specific caller today happens to. Rewrite the affected doc comments (and one internal impl comment) to describe behavior and motivation generically: "a deterministic on-demand drain hook", "a handle that has no Lambda lifecycle state to drive", and so on. Addresses review feedback on PR #1201.
…ension trait
The flushing_service: Option<Arc<FlushingService>> field was test-mode
scaffolding on the production struct. Every TraceAgent in production
carried a None that existed solely for a future test binary, and any
additional test-mode hook would have had to add another Option field
with the same shape, each pulling another consumer-specific dependency
into trace_agent.rs.
Replace it with a single generic extension seam:
- pub trait RouterExtension { fn extend(&self, router: Router) -> Router; }
- TraceAgent stores Option<Arc<dyn RouterExtension>> and exposes
with_router_extension(self, ext) -> Self in place of
with_flushing_service.
- make_router calls extension.extend(router) when set, before applying
the outer fallback and body-limit layers.
- FLUSH_ENDPOINT_PATH, the flush handler, and the crate::flushing
FlushingService import are removed from trace_agent.rs. The trace-
agent module now knows nothing about FlushingService.
Future test-mode hooks add new methods (with defaults) to the trait or
new impls behind it, not new Option fields on TraceAgent.
A #[cfg(test)]-only SpyExtension validates the seam end-to-end: trait
shape compiles, state is carried via Arc, and merged routes are
reachable through the composed Router.
Addresses review feedback on PR #1201 (deferred comment 6).
The previous SpyExtension test called `extension.extend(Router::new())` directly, which only proves axum's Router::merge works. Build an actual TraceAgent via a small test helper, attach the spy via with_router_extension, and hit /spy through the full router returned by make_router. Add a second test confirming that with no extension attached, /spy falls through to the 404 handler. Also trim the "deterministic drain endpoint" example from the with_router_extension and start_trace_agent doc blocks; keep it only in the trait doc to avoid repetition across three locations.
…comment Dropping a oneshot sender causes the receiver to return ProcessorError::ChannelReceive, not a hang. Update the comment to describe the actual observable behavior. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…omment The previous note mentioned only the three tasks spawned directly in build_trace_agent (aggregator, concentrator, dedup). TraceAgent::new unconditionally spawns a fourth task: the trace-payload drain loop. Update the summary line and the leak warning to reflect all four tasks. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
…rrors Previously, a panic or error inside extend() would propagate as a panic through make_router() and start(), then be silently dropped by the tokio::spawn wrapper in start_trace_agent (which only logged Err returns, not panics). Changing the return type to Result<Router, Box<dyn std::error::Error>> means implementors must return errors explicitly. Those errors propagate through make_router via ?, then through start() to the spawn wrapper, where they are already logged. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
… timeout The exhaustive match in noop() prevents unhandled variants at compile time, but it cannot prevent a future contributor from placing a response-carrying variant in the fire-and-forget arm. In that case, rx.await returns ProcessorError::ChannelReceive, and the test would hang rather than fail cleanly. Add a test that wraps every request-response call in a 500ms timeout so any such regression produces a clear failure instead of a hung test suite. 🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
🤖 Co-Authored-By: Claude Code <noreply@anthropic.com>
05677cb to
c825e8d
Compare
Overview
Preparatory refactors for the upcoming
bottlecap-testmodebinary (APMSVLS-501), which reusesTraceAgent/handle_tracesbut has no Lambda lifecycle to drive. Three logical changes:1.
InvocationProcessorHandle::noop()Constructor backed by a background task that acknowledges every
ProcessorCommandwith a sensible default so callers never block on their response oneshots:GetReparentingInfo,UpdateReparenting,SetColdStartSpanTraceId,PlatformRuntimeDone,PlatformReport) reply with empty/default values.The
matchis exhaustive: adding a newProcessorCommandvariant will cause a compile error here, forcing test-mode behavior to be decided explicitly. A response-carrying variant placed in the fire-and-forget arm would silently drop its sender, causing the caller to receiveProcessorError::ChannelReceiveinstead of the intended default.The noop channel capacity matches the real
InvocationProcessorServiceto avoid backpressure surprises.2. Promote
start_trace_agentto a top-level library moduleMoved the
start_trace_agenthelper from the privatebin/bottlecap/main.rsinto a new public module,bottlecap::startup, so callers outside the Lambda binary can construct the full trace-processing pipeline through a single call instead of duplicating ~110 lines of wiring. Placed at the crate root (not undertraces/) because the module wires trace, stats, proxy, config, lifecycle, tags, appsec, and flushing pieces together: it is cross-cutting orchestration, not a trace-domain API. No behavior change; the Lambda binary's call site is unchanged (same function name, same signature, same internal spawn).The companion
TraceAgentPipelinereturn type is apub structwith namedpubfields (trace/stats/proxy flushers, the trace-channel sender, the shutdown token, and the aggregator/concentrator handles), rather than a positional tuple.3.
RouterExtensionseam onTraceAgentAdds a generic extension point that lets a caller merge additional axum routes into the trace-agent's HTTP router without
TraceAgentknowing what those routes do. When no caller attaches an extension (the Lambda binary case), the HTTP surface on port 8126 is unchanged.pub trait RouterExtension: Send + Sync { fn extend(&self, router: Router) -> Result<Router, Box<dyn Error>>; }defined intraces::trace_agent. ReturningErraborts agent startup and surfaces the error in the existing log path, preventing silent failures from a panicking or misconfigured extension.TraceAgentholdsrouter_extension: Option<Arc<dyn RouterExtension>>with a consuming builder methodwith_router_extension(self, ext: Arc<dyn RouterExtension>) -> Self.make_routercallsextension.extend(router)?when the field is set, before applying the outer fallback and body-limit layers.startup.rsintobuild_trace_agent(returns an unspawnedTraceAgentplus theTraceAgentPipelinehandles) and a thinstart_trace_agentwrapper that spawns. Callers that need to attach aRouterExtensionbefore startup usebuild_trace_agent; the Lambda binary's call site continues to usestart_trace_agent.#[cfg(test)]-onlySpyExtensionvalidates the seam end-to-end: the trait shape compiles, state is carried viaArc, and merged routes are reachable through the composedRouterreturned bymake_router.Any concrete flush/drain/diagnostic endpoint lives behind an impl of this trait in the consumer crate, not in
trace_agent.rs.Testing
cargo check --bin bottlecapcargo check --libcargo test --libcargo clippy --workspace --all-targets --features default -- -D warningscargo fmt --all -- --checkcargo nextest run --workspace(532/532 passed)InvocationProcessorHandle::noop():noop_request_response_methods_return_defaults,noop_fire_and_forget_commands_do_not_panic,noop_platform_runtime_done_and_report_respond_without_blocking,noop_request_response_variants_complete_within_timeoutRouterExtensionseam:with_router_extension_adds_reachable_route_to_make_router,make_router_returns_404_for_extension_route_when_none_attached