Skip to content

apollo_l1_events: reuse running sync task on catch-up restart by raising its shared target height#14518

Merged
ShahakShama merged 1 commit into
main-v0.14.3from
audit-H14
Jun 21, 2026
Merged

apollo_l1_events: reuse running sync task on catch-up restart by raising its shared target height#14518
ShahakShama merged 1 commit into
main-v0.14.3from
audit-H14

Conversation

@ShahakShama

@ShahakShama ShahakShama commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Goal: Fix audit finding H-14 — unbounded task leak in the L1EventsProvider
catch-up flow.

Change summary:

  • On catch-up restart, reuse the in-flight l2_sync_task instead of recreating the
    Catchupper and spawning a second one. target_height is now an Arc
    shared with the running task; start_catching_up raises it via
    update_target_height when a task is already running, and the task re-reads it
    each iteration so it keeps syncing up to the new target.
  • Dropping a tokio JoinHandle does not cancel the task, so before this every
    catch-up restart leaked a still-running sync task that races to commit the same
    blocks and can spin forever in its retry loop (DoS).
  • Add regression test restarting_catch_up_raises_target_without_spawning_second_sync_task.

Decision points:

  • Keep the running task and raise its shared target rather than abort + respawn:
    aborting a task parked on a client call is not cancel-safe (the commit may apply
    while its result is never observed). A single long-lived task that tracks the
    latest target avoids both the leak and the cancel-safety hazard.
  • Atomic orderings: Release on the writer (update_target_height), Acquire on the
    readers, since there is a single writer (provider) and single reader (sync task).

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

@cursor

cursor Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches startup/catch-up concurrency and L2 commit_block delivery; the change is localized and guarded by a regression test, but mistaken reuse or atomic ordering could still affect provider height during node bootstrap.

Overview
Fixes unbounded tokio task leak when catch-up is restarted while an L2 sync task is still running (audit H-14). Previously start_catching_up always called reset_catchupper and spawned a new l2_sync_task; dropping the old JoinHandle left the prior task alive, so multiple tasks could race on the same block commits and retry forever.

Catchupper now stores target_height as Arc<AtomicU64> shared with l2_sync_task, which reloads the target each loop iteration. New APIs: update_target_height (fetch_max, forward-only), is_sync_task_running, and target_height() for reads.

L1EventsProvider::start_catching_up sets CatchingUp first, then if a sync task is already in flight it only raises the shared target and returns; otherwise it resets the catchupper and starts sync as before. Aborting the old task was avoided because client calls are not cancel-safe.

A regression test asserts the same task handle and raised target when catch-up restarts with a higher block.

Reviewed by Cursor Bugbot for commit f8fa0b1. Bugbot is set up for automated code reviews on this repo. Configure here.

ShahakShama commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@noamsp-starkware noamsp-starkware left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the commit message and description to reflect the actual changes introduced. something like 'reuse running sync task on catch-up restart by raising its shared target height'.

@noamsp-starkware reviewed 4 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on ShahakShama).


crates/apollo_l1_events/src/catchupper.rs line 109 at r1 (raw file):

    pub fn target_height(&self) -> BlockNumber {
        BlockNumber(self.target_height.load(Ordering::SeqCst))

Pair this read with Acquire to match the Release write below. SeqCst adds a global total-order guarantee across all SeqCst ops in the program; there's no other atomic here that needs to be sequenced against this one, so it's wasted.


crates/apollo_l1_events/src/catchupper.rs line 115 at r1 (raw file):

    /// Uses `fetch_max` so the target only moves forward; a lower height is ignored.
    pub fn update_target_height(&self, target_height: BlockNumber) {
        self.target_height.fetch_max(target_height.0, Ordering::SeqCst);

SeqCst is stronger than needed here. There's a single writer (the provider) and single reader (the sync task), so Release on the write side is sufficient — it guarantees all prior writes are visible before the new target value is published.

Code snippet:

self.target_height.fetch_max(target_height.0, Ordering::Release);

crates/apollo_l1_events/src/catchupper.rs line 173 at r1 (raw file):

    // (a higher block committed before catch-up finishes) extends this same task instead of
    // spawning a competing one.
    while current_height.0 <= target_height.load(Ordering::SeqCst) {

Acquire


crates/apollo_l1_events/src/catchupper.rs line 178 at r1 (raw file):

            "Syncing L1EventsProvider with L2 height: {} to target height: {}",
            current_height,
            target_height.load(Ordering::SeqCst)

Acquire

@ShahakShama ShahakShama changed the title apollo_l1_events: abort orphaned L2 sync task on catch-up restart apollo_l1_events: reuse running sync task on catch-up restart by raising its shared target height Jun 18, 2026

@ShahakShama ShahakShama left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahakShama made 4 comments.
Reviewable status: 3 of 4 files reviewed, 4 unresolved discussions (waiting on noamsp-starkware).


crates/apollo_l1_events/src/catchupper.rs line 109 at r1 (raw file):

Previously, noamsp-starkware wrote…

Pair this read with Acquire to match the Release write below. SeqCst adds a global total-order guarantee across all SeqCst ops in the program; there's no other atomic here that needs to be sequenced against this one, so it's wasted.

Done.


crates/apollo_l1_events/src/catchupper.rs line 115 at r1 (raw file):

Previously, noamsp-starkware wrote…

SeqCst is stronger than needed here. There's a single writer (the provider) and single reader (the sync task), so Release on the write side is sufficient — it guarantees all prior writes are visible before the new target value is published.

Done.


crates/apollo_l1_events/src/catchupper.rs line 173 at r1 (raw file):

Previously, noamsp-starkware wrote…

Acquire

Done.


crates/apollo_l1_events/src/catchupper.rs line 178 at r1 (raw file):

Previously, noamsp-starkware wrote…

Acquire

Done.

@ShahakShama ShahakShama left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahakShama resolved 4 discussions and dismissed @noamsp-starkware from 4 discussions.
Reviewable status: 3 of 4 files reviewed, all discussions resolved (waiting on noamsp-starkware).

@ShahakShama ShahakShama enabled auto-merge June 21, 2026 11:38
@asaf-sw asaf-sw self-requested a review June 21, 2026 12:19

@asaf-sw asaf-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix for the task leak — sharing an Arc<AtomicU64> target and reusing the in-flight task is the right call over abort+respawn, and the regression test pins the behavior well (ptr_eq on the handle, forever-failing sync). Confirmed the skipped reset_catchupper in the running branch is benign: the backlog is always drained via mem::take and the state set to Pending at catch-up completion, so it's empty whenever this branch is reached.

Two notes inline: one TOCTOU race worth flagging, one atomic-ordering nit.

Comment thread crates/apollo_l1_events/src/l1_events_provider.rs
Comment thread crates/apollo_l1_events/src/catchupper.rs

@asaf-sw asaf-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@asaf-sw reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ShahakShama).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dff312e. Configure here.

Comment thread crates/apollo_l1_events/src/l1_events_provider.rs

@ShahakShama ShahakShama left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahakShama resolved 1 discussion.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ShahakShama).

@asaf-sw asaf-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asaf-sw made 2 comments and resolved 2 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ShahakShama).

Comment thread crates/apollo_l1_events/src/catchupper.rs
Comment thread crates/apollo_l1_events/src/l1_events_provider.rs

@asaf-sw asaf-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ShahakShama).

@asaf-sw asaf-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@asaf-sw made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ShahakShama).

@ShahakShama ShahakShama disabled auto-merge June 21, 2026 13:28
@ShahakShama ShahakShama enabled auto-merge June 21, 2026 13:28
…ing its shared target height

Goal: Fix audit finding H-14 — unbounded task leak in the L1EventsProvider
catch-up flow.

Change summary:
- On catch-up restart, reuse the in-flight l2_sync_task instead of recreating the
  Catchupper and spawning a second one. target_height is now an Arc<AtomicU64>
  shared with the running task; start_catching_up raises it via
  update_target_height when a task is already running, and the task re-reads it
  each iteration so it keeps syncing up to the new target.
- Dropping a tokio JoinHandle does not cancel the task, so before this every
  catch-up restart leaked a still-running sync task that races to commit the same
  blocks and can spin forever in its retry loop (DoS).
- Add regression test restarting_catch_up_raises_target_without_spawning_second_sync_task.

Decision points:
- Keep the running task and raise its shared target rather than abort + respawn:
  aborting a task parked on a client call is not cancel-safe (the commit may apply
  while its result is never observed). A single long-lived task that tracks the
  latest target avoids both the leak and the cancel-safety hazard.
- Atomic orderings: Release on the writer (update_target_height), Acquire on the
  readers, since there is a single writer (provider) and single reader (sync task).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ShahakShama added a commit that referenced this pull request Jun 21, 2026
…e, relax atomic ordering

Goal: address asaf-sw's review notes on PR #14518 in a follow-up branch (the base commit is left untouched per request).

Changes:
- start_catching_up: after raising the target on a running sync task, re-check is_sync_task_running(); if the task finished in the TOCTOU window, fall through to respawn instead of leaving the provider stuck in CatchingUp with no sync task.
- target_height atomic: Acquire/Release -> Relaxed on both load and fetch_max sides, since the atomic carries only its own value and publishes no other memory.

Decision points:
- TOCTOU: chose respawn-on-finish (double-check) over comment-only. sync_task_health_check remains the backstop for the residual narrow window where a task finishes even after the second check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ShahakShama ShahakShama added this pull request to the merge queue Jun 21, 2026
Merged via the queue into main-v0.14.3 with commit 9f78ee7 Jun 21, 2026
19 checks passed
ShahakShama added a commit that referenced this pull request Jun 21, 2026
…e, relax atomic ordering

Goal: address asaf-sw's review notes on PR #14518 in a follow-up branch (the base commit is left untouched per request).

Changes:
- start_catching_up: after raising the target on a running sync task, re-check is_sync_task_running(); if the task finished in the TOCTOU window, fall through to respawn instead of leaving the provider stuck in CatchingUp with no sync task.
- target_height atomic: Acquire/Release -> Relaxed on both load and fetch_max sides, since the atomic carries only its own value and publishes no other memory.

Decision points:
- TOCTOU: chose respawn-on-finish (double-check) over comment-only. sync_task_health_check remains the backstop for the residual narrow window where a task finishes even after the second check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants