Skip to content

test(ci): widen timing windows for flaky SessionStore + socket.io tests#7647

Open
JohnMcLear wants to merge 2 commits intodevelopfrom
fix/ci-flake-session-socketio-timing
Open

test(ci): widen timing windows for flaky SessionStore + socket.io tests#7647
JohnMcLear wants to merge 2 commits intodevelopfrom
fix/ci-flake-session-socketio-timing

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

Two flaky test groups dominate develop CI failures (~30% of recent runs over the past week). Both are timing fragility, not logic bugs.

  • tests/backend/specs/SessionStore.tsset(sess) schedules a timeout based on the cookie's expires. Tests use 100ms expiries + 110ms waits. On slow runners (Windows / loaded Linux), the wall-clock between set() and the assertion can exceed 100ms; _updateExpirations then sees exp.real <= now and calls _destroy(), deleting the DB record. The next assertion reads null / undefined and fails. Tripled affected windows (100→300, 110→330, 200→600) — semantics preserved, ~+3s headroom.
  • tests/backend/common.ts:waitForSocketEvent — 1s timeout. Socket.io handshake + auth + CLIENT_READY round-trip can exceed 1s on a slow runner. Bumped to 5s.

This addresses the specific test cases consistently appearing in failed runs:

  • SessionStore: get of record from previous run (not yet expired), external expiration update is picked up, shutdown cancels timeouts
  • socketio: !authn anonymous cookie /p/pad -> 200, ok, authn user /p/pad -> 200, ok
  • clientvar_rev_consistency: CLIENT_VARS stays consistent under concurrent edits during handshake

The latest develop commit happened to be green, but ~25–40% of recent develop runs failed on at least one of these. This PR removes that flake.

Test plan

  • mocha tests/backend/specs/SessionStore.ts — 28 pass locally (5s)
  • mocha tests/backend/specs/socketio.ts — 33 pass locally (2s)
  • CI Backend tests green across all 8 OS×Node matrix combos
  • CI "Upgrade from latest release" green

🤖 Generated with Claude Code

SessionStore.ts and socketio.ts dominate develop CI failures (~25–40% of
recent runs). Both fail due to race conditions on slow Windows / loaded
Linux runners — not logic bugs.

SessionStore tests configure session expiry windows of 100ms and then
sleep 110ms before asserting. On a slow runner, the wall-clock between
`set()` and the assertion can exceed 100ms, the timeout in
`SessionStore._updateExpirations()` then sees `exp.real <= now` and
calls `_destroy()`, deleting the DB record before the assertion runs.
The test then reads `null` / `undefined` instead of the expected JSON.

Tripling each affected window (100→300, 110→330, 200→600) keeps the
relative timing semantics identical while leaving enough headroom for
a slow CI runner. Local run is +3s on this spec; cheap insurance for
the global flake rate.

`waitForSocketEvent` in tests/backend/common.ts uses a 1s timeout for
socket.io message round-trips. The socket.io handshake + auth +
CLIENT_READY can exceed 1s on a slow runner; bumped to 5s.

The most-failing tests this addresses:
  - SessionStore: get of record from previous run (not yet expired)
  - SessionStore: external expiration update is picked up
  - SessionStore: shutdown cancels timeouts
  - socketio: !authn anonymous cookie /p/pad -> 200, ok
  - socketio: authn user /p/pad -> 200, ok
  - clientvar_rev_consistency: CLIENT_VARS stays consistent under
    concurrent edits during handshake

All 28 SessionStore + 33 socketio tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Widen timing windows for flaky SessionStore and socket.io tests

🧪 Tests

Grey Divider

Walkthroughs

Description
• Increase SessionStore test timing windows from 100/110ms to 300/330ms
• Extend socket.io event timeout from 1s to 5s
• Eliminate race conditions on slow CI runners
• Preserve test semantics while adding headroom
Diagram
flowchart LR
  A["Slow CI Runners"] -->|"Race Conditions"| B["Flaky Tests"]
  B -->|"SessionStore: 100→300ms expiry"| C["Wider Timing Windows"]
  B -->|"Wait: 110→330ms"| C
  B -->|"Socket.io: 1s→5s timeout"| C
  C -->|"Eliminates Flake"| D["Stable CI Runs"]
Loading

Grey Divider

File Changes

1. src/tests/backend/common.ts 🧪 Tests +1/-1

Extend socket.io event wait timeout

• Increased waitForSocketEvent timeout from 1000ms to 5000ms
• Accommodates slower socket.io handshake + auth + CLIENT_READY round-trips
• Prevents timeout failures on loaded CI runners

src/tests/backend/common.ts


2. src/tests/backend/specs/SessionStore.ts 🧪 Tests +16/-16

Triple SessionStore timing windows for stability

• Tripled session expiry windows from 100ms to 300ms across 10 test cases
• Tripled wait periods from 110ms to 330ms to match expiry changes
• Updated one external expiration window from 200ms to 600ms
• Maintains relative timing semantics while preventing race conditions

src/tests/backend/specs/SessionStore.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 1, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Hardcoded 5s socket wait🐞 Bug ◔ Observability
Description
waitForSocketEvent now hardcodes a 5s timeout for all socket events, which can cause suites that
rely on Mocha’s default per-test timeout to fail with a generic Mocha timeout instead of the
helper’s descriptive error. It also makes failing runs wait longer before surfacing the root cause.
Code

src/tests/backend/common.ts[R126-132]

     const timeout = setTimeout(() => {
       reject(new Error(`timed out waiting for ${event} event`));
       cancelTimeout = () => {};
-      }, 1000);
+      }, 5000);
     cancelTimeout = () => {
       clearTimeout(timeout);
       resolve();
Evidence
The helper’s timeout was increased to 5000ms, but at least one suite that uses connect()/handshake()
(which depends on waitForSocketEvent) does not set a larger Mocha timeout, so the test framework can
time out first and obscure the intended failure message.

src/tests/backend/common.ts[114-158]
src/tests/backend/specs/messages.ts[12-36]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`waitForSocketEvent()` uses a hardcoded 5000ms timeout for *all* socket waits. In suites that don’t increase Mocha’s default per-test timeout, failures may surface as a generic Mocha timeout instead of `waitForSocketEvent`’s explicit `timed out waiting for <event>` error.
### Issue Context
Some callers (e.g., connect/handshake paths) legitimately need a longer timeout on slow CI runners, but other call sites benefit from failing fast and producing a clear error.
### Fix Focus Areas
- Add an optional `timeoutMs` parameter to `waitForSocketEvent(socket, event, timeoutMs?)`, and use it in `setTimeout(..., timeoutMs)`.
- Update slow paths (`connect()`, `handshake()`, and any other known-slow call sites) to pass `5000` explicitly.
- Keep a shorter default (or ensure suites that rely on defaults set `this.timeout(...)` high enough) to avoid Mocha masking the helper’s error.
- src/tests/backend/common.ts[114-219]
- src/tests/backend/specs/messages.ts[12-60]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. SessionStore waits still tight🐞 Bug ☼ Reliability
Description
SessionStore expiry tests still rely on fixed sleeps with only ~30ms headroom (e.g., expires in
300ms, then sleep 330ms), so timer jitter or event-loop delays can still cause intermittent failures
under heavy CI load. This PR improves the margin vs. before, but it doesn’t eliminate the underlying
flake pattern.
Code

src/tests/backend/specs/SessionStore.ts[R59-66]

   it('set of session that expires', async function () {
-      const sess:any  = {foo: 'bar', cookie: {expires: new Date(Date.now() + 100)}};
+      const sess:any  = {foo: 'bar', cookie: {expires: new Date(Date.now() + 300)}};
     await set(sess);
     assert.equal(JSON.stringify(await db.get(`sessionstorage:${sid}`)), JSON.stringify(sess));
-      await new Promise((resolve) => setTimeout(resolve, 110));
+      await new Promise((resolve) => setTimeout(resolve, 330));
     // Writing should start a timeout.
     assert(await db.get(`sessionstorage:${sid}`) == null);
   });
Evidence
Session expiration cleanup is scheduled via setTimeout(exp.real - now) and the implementation
explicitly calls out slow-system timing races. With only 30ms cushion, the test can still assert
before the cleanup has actually executed.

src/tests/backend/specs/SessionStore.ts[59-66]
src/node/db/SessionStore.ts[111-140]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Several SessionStore tests use fixed `setTimeout` sleeps and then assert the DB record is gone/present. Even with the increased windows, the assertions can still race the actual cleanup work if the event loop is delayed.
### Issue Context
`SessionStore` schedules expiration cleanup with `setTimeout(...)` and documents races on slow systems. Tests that assume cleanup has run at an exact time remain inherently timing-fragile.
### Fix Focus Areas
- Replace fixed sleeps like `await new Promise(r => setTimeout(r, 330))` + assert with a small polling helper (e.g., poll every 25ms up to a 2–5s max) that waits until the DB condition is met.
- Keep the expiry durations modest, but remove tight coupling between “sleep duration” and “expiry duration”.
- src/tests/backend/specs/SessionStore.ts[47-168]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

… for cleanup

Both items raised in Qodo's review of #7647.

1) Hardcoded 5s socket wait
   waitForSocketEvent() now takes an optional timeoutMs (default 1000ms,
   matching pre-PR behaviour). Only the known-slow connect() and
   handshake() paths pass 5000ms — they're the ones blowing the 1s budget
   on loaded CI runners. Per-message waits (waitForAcceptCommit and
   ad-hoc callers in messages.ts etc.) keep the 1s default so failures
   surface fast with the descriptive helper error rather than the generic
   Mocha timeout.

2) SessionStore waits still tight
   Replaced fixed sleeps with a small `eventually()` poller for the three
   "record should be gone after expiry" assertions:
     - set of session that expires
     - switch from non-expiring to expiring
     - get of record from previous run (not yet expired)
   These now poll every 25ms up to 2000ms so they pass immediately when
   cleanup completes on a fast runner, and tolerate hundreds of ms of
   event-loop delay on a slow one. No fixed coupling between sleep
   duration and expiry duration.

   For the inverse "record should still be there" case in
   `shutdown cancels timeouts`, polling doesn't apply (we're verifying
   that a cancelled timer did NOT fire, which requires a real wait past
   the original expiry). Bumped expires 300→500ms and wait 330→700ms so
   setup (set+get+shutdown) has 500ms before the timer would fire (vs.
   30ms previously) and the 700ms wait still passes the original expiry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear
Copy link
Copy Markdown
Member Author

Pushed f339991 addressing both Qodo points:

1. Hardcoded 5s socket waitwaitForSocketEvent(socket, event, timeoutMs?) now accepts an optional timeout, default 1000ms (matches pre-PR behaviour). Only the known-slow connect() and handshake() call sites pass 5000ms — they're the ones that were timing out on loaded CI. Per-message waits in messages.ts, clientvar_rev_consistency, etc. keep the 1s default so failures surface fast with the descriptive helper error rather than getting masked by Mocha's per-test timeout.

2. SessionStore waits still tight — for the three "record should be gone after expiry" assertions (set of session that expires, switch from non-expiring to expiring, get of record from previous run (not yet expired)) replaced the fixed sleep-then-assert with a small eventually() poller (25ms / 2000ms max). These now pass immediately when cleanup completes on a fast runner and tolerate hundreds of ms of event-loop delay on a slow one. No fixed coupling between sleep duration and expiry duration.

For the inverse case shutdown cancels timeouts polling doesn't apply (we're verifying that a cancelled timer did not fire — requires waiting past the original expiry, can't poll for "still nothing happened"). Bumped expires 300→500ms and wait 330→700ms so setup (set+get+shutdown) has a 500ms cushion (vs 30ms previously) before the timer would have fired, and the 700ms wait is still past the original expiry.

All 28 SessionStore + 33 socketio tests pass locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant