apollo_http_server: bound in-flight add_tx requests with a semaphore to shed flood load#14566
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
…to shed flood load
934717f to
26cb44c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26cb44c730
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .add_tx_semaphore | ||
| .clone() | ||
| .try_acquire_owned() | ||
| .map_err(|_| HttpServerError::AtCapacityError())?; |
There was a problem hiding this comment.
Count capacity rejections as failed transactions
When the semaphore is exhausted, this returns AtCapacityError after the handlers have already incremented ADDED_TRANSACTIONS_TOTAL, but it bypasses record_added_transactions / increment_failure_metrics. During flood shedding, every 503 capacity rejection is therefore counted as received but not as failed, which makes the HTTP failure-rate alert and success/failure dashboards underreport exactly the overload condition this path is meant to expose.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Should be recorded to the failure metric
| let permit = app_state | ||
| .add_tx_semaphore | ||
| .clone() | ||
| .try_acquire_owned() | ||
| .map_err(|_| HttpServerError::AtCapacityError())?; |
There was a problem hiding this comment.
Move the capacity check before parsing request bodies
Because the permit is acquired only inside add_tx_inner, requests reaching this path have already paid the expensive body work: the RPC endpoint's Json(tx) extractor has deserialized the transaction, and the deprecated endpoint has read the full String, parsed it, and converted it before calling here. Under overload with 128 gateway calls in flight, additional max-size or compressed requests are still read and parsed before being rejected with 503, so this does not actually bound in-flight HTTP add_tx payload work during the flood scenario this change is intended to shed.
Useful? React with 👍 / 👎.
| // Bounds the number of in-flight add_tx requests, each of which holds its payload and detached | ||
| // gateway task for the request's full lifetime. Aligned with the gateway component server's own | ||
| // concurrency bound. Excess requests are rejected with 503 rather than queued. | ||
| const DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTS: usize = 128; |
There was a problem hiding this comment.
Use configured gateway concurrency for capacity
This hard-codes the HTTP add_tx limit to the default gateway concurrency instead of the configured gateway server/client limit. In deployments where components.gateway.max_concurrency or components.gateway.local_server_config.max_concurrency is tuned away from 128, the HTTP server either sheds valid load too early or still sends more concurrent gateway requests than the gateway can process, defeating the stated alignment with the gateway component's own bound.
Useful? React with 👍 / 👎.
matanl-starkware
left a comment
There was a problem hiding this comment.
@matanl-starkware reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on dan-starkware).
crates/apollo_http_server/src/http_server.rs line 108 at r1 (raw file):
let (dynamic_config_tx, dynamic_config_rx) = channel::<HttpServerDynamicConfig>(config.dynamic_config.clone()); let add_tx_semaphore = Arc::new(Semaphore::new(DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTS));
Should be taken from the config (components.gateway.max_concurrency)
Code quote:
DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTScrates/apollo_http_server/src/http_server.rs line 315 at r1 (raw file):
let gateway_input: GatewayInput = GatewayInput { rpc_tx: tx, message_metadata: None }; // Bound concurrent in-flight requests: acquire the permit before spawning so excess requests // are rejected with 503 rather than piling up as detached tasks under flood.
Product (Ohad) specifically asked to avoid returning 503 for add_tx.
Code quote:
rejected with 503| .add_tx_semaphore | ||
| .clone() | ||
| .try_acquire_owned() | ||
| .map_err(|_| HttpServerError::AtCapacityError())?; |
There was a problem hiding this comment.
Should be recorded to the failure metric

No description provided.