Skip to content

apollo_http_server: bound in-flight add_tx requests with a semaphore to shed flood load#14566

Open
dan-starkware wants to merge 1 commit into
mainfrom
06-18-apollo_http_server_bound_in-flight_add_tx_requests_with_a_semaphore_to_shed_flood_load
Open

apollo_http_server: bound in-flight add_tx requests with a semaphore to shed flood load#14566
dan-starkware wants to merge 1 commit into
mainfrom
06-18-apollo_http_server_bound_in-flight_add_tx_requests_with_a_semaphore_to_shed_flood_load

Conversation

@dan-starkware

Copy link
Copy Markdown
Collaborator

No description provided.

@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

dan-starkware commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

@dan-starkware dan-starkware force-pushed the 06-18-apollo_http_server_bound_in-flight_add_tx_requests_with_a_semaphore_to_shed_flood_load branch from 934717f to 26cb44c Compare June 18, 2026 14:50
@dan-starkware dan-starkware marked this pull request as ready for review June 21, 2026 07:19

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26cb44c730

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.add_tx_semaphore
.clone()
.try_acquire_owned()
.map_err(|_| HttpServerError::AtCapacityError())?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count capacity rejections as failed transactions

When the semaphore is exhausted, this returns AtCapacityError after the handlers have already incremented ADDED_TRANSACTIONS_TOTAL, but it bypasses record_added_transactions / increment_failure_metrics. During flood shedding, every 503 capacity rejection is therefore counted as received but not as failed, which makes the HTTP failure-rate alert and success/failure dashboards underreport exactly the overload condition this path is meant to expose.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be recorded to the failure metric

Comment on lines +316 to +320
let permit = app_state
.add_tx_semaphore
.clone()
.try_acquire_owned()
.map_err(|_| HttpServerError::AtCapacityError())?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move the capacity check before parsing request bodies

Because the permit is acquired only inside add_tx_inner, requests reaching this path have already paid the expensive body work: the RPC endpoint's Json(tx) extractor has deserialized the transaction, and the deprecated endpoint has read the full String, parsed it, and converted it before calling here. Under overload with 128 gateway calls in flight, additional max-size or compressed requests are still read and parsed before being rejected with 503, so this does not actually bound in-flight HTTP add_tx payload work during the flood scenario this change is intended to shed.

Useful? React with 👍 / 👎.

// Bounds the number of in-flight add_tx requests, each of which holds its payload and detached
// gateway task for the request's full lifetime. Aligned with the gateway component server's own
// concurrency bound. Excess requests are rejected with 503 rather than queued.
const DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTS: usize = 128;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use configured gateway concurrency for capacity

This hard-codes the HTTP add_tx limit to the default gateway concurrency instead of the configured gateway server/client limit. In deployments where components.gateway.max_concurrency or components.gateway.local_server_config.max_concurrency is tuned away from 128, the HTTP server either sheds valid load too early or still sends more concurrent gateway requests than the gateway can process, defeating the stated alignment with the gateway component's own bound.

Useful? React with 👍 / 👎.

@matanl-starkware matanl-starkware left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matanl-starkware reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on dan-starkware).


crates/apollo_http_server/src/http_server.rs line 108 at r1 (raw file):

        let (dynamic_config_tx, dynamic_config_rx) =
            channel::<HttpServerDynamicConfig>(config.dynamic_config.clone());
        let add_tx_semaphore = Arc::new(Semaphore::new(DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTS));

Should be taken from the config (components.gateway.max_concurrency)

Code quote:

DEFAULT_MAX_CONCURRENT_ADD_TX_REQUESTS

crates/apollo_http_server/src/http_server.rs line 315 at r1 (raw file):

    let gateway_input: GatewayInput = GatewayInput { rpc_tx: tx, message_metadata: None };
    // Bound concurrent in-flight requests: acquire the permit before spawning so excess requests
    // are rejected with 503 rather than piling up as detached tasks under flood.

Product (Ohad) specifically asked to avoid returning 503 for add_tx.

Code quote:

rejected with 503

.add_tx_semaphore
.clone()
.try_acquire_owned()
.map_err(|_| HttpServerError::AtCapacityError())?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be recorded to the failure metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants