Conversation
…nd Trigger handlers Add instrumentation to diagnose whether Vercel function instances are being suspended between dispatch and background execution. Logs `dispatchDelayMs` to measure the gap between when work is queued via waitUntil and when the async handler actually starts executing. Warns when waitUntil is unavailable (fire-and-forget) and when delays exceed 5 seconds, indicating possible instance suspension. Co-authored-by: Cursor <cursoragent@cursor.com>
…solution The `getWaitUntil()` utility in agents-core dynamically imports `@vercel/functions`, but the package was only declared as a dependency in agents-api. With pnpm's strict dependency isolation, agents-core could not resolve it at runtime, causing `ERR_MODULE_NOT_FOUND` and making all waitUntil-based background work (Slack mentions, webhook triggers) silently fall back to untracked fire-and-forget execution. Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
There was a problem hiding this comment.
PR Review Summary
(5) Total Issues | Risk: Low
🟡 Minor (5) 🟡
🟡 1) multi-file Hardcoded timing thresholds without named constants
files:
agents-api/src/domains/run/services/TriggerService.ts:635packages/agents-work-apps/src/slack/services/events/app-mention.ts:82,93,141,307
Issue: The PR introduces magic numbers for timing thresholds — 5000ms for dispatch delay warnings (suspension detection) and 3000ms for slow operation warnings — without extracting them to named constants or documenting why these values were chosen.
Why: Magic numbers in alerting thresholds are a common source of noisy alerts or missed detections. Without documentation, future maintainers won't know if 5 seconds was chosen based on Vercel's serverless characteristics, observed P99 latencies, or another rationale. This makes tuning for different deployment contexts difficult.
Fix: Extract thresholds to named constants with brief documentation:
// In a shared constants file or at the top of each file:
/** Threshold for warning about possible serverless instance suspension.
* 5s chosen based on typical Vercel cold start being <2s; delays beyond
* this suggest the instance was suspended mid-execution. */
const DISPATCH_DELAY_WARNING_THRESHOLD_MS = 5000;
/** Threshold for warning about slow I/O operations. */
const SLOW_OPERATION_THRESHOLD_MS = 3000;Refs:
🟡 2) package.json Version specifier differs from agents-api
Issue: The new @vercel/functions dependency uses ^1.6.0 but agents-api/package.json specifies ^1.4.0 for the same package.
Why: While pnpm's lockfile ensures consistent resolution, having different version specifiers across packages can cause confusion during maintenance and version bumps.
Fix: Align the version specifier with agents-api (using ^1.4.0) or update both to use the same version floor.
Refs:
Inline Comments:
- 🟡 Minor:
knip.config.ts:8Stale knip config comment about @vercel/functions - 🟡 Minor:
app-mention.ts:75Span attribute uses inconsistent naming convention - 🟡 Minor:
@vercel__functions/index.d.ts:1-6Type declaration comment is now stale
💭 Consider (2) 💭
💭 1) events.ts Missing waitUntil warning for non-app_mention handlers
Issue: The PR adds a warning log when waitUntil is unavailable for app_mention events, but other background work handlers in the same file (modal interactions, shortcuts, etc.) don't have the same warning pattern.
Why: Creates inconsistent diagnostic coverage — some fire-and-forget scenarios will be logged while others silently proceed untracked.
Fix: Either apply the same warning pattern to all waitUntil usages for consistency, or document that app_mention is intentionally the only instrumented path (perhaps it's the most critical).
💭 2) events.ts:230 Log timing differs from TriggerService pattern
Issue: The new success log fires after calling waitUntil(), but TriggerService.ts logs before calling it. Either approach is valid, but this creates a subtle inconsistency.
Fix: Consider aligning with the TriggerService pattern (log before call) for consistency, or keep as-is if the "registered" phrasing intentionally indicates post-registration.
💡 APPROVE WITH SUGGESTIONS
Summary: This is a well-structured observability enhancement that adds valuable diagnostic instrumentation for detecting Vercel serverless instance suspension. The core implementation is sound — tracking dispatch-to-execution delays and logging slow operations will help debug production issues. The suggestions above are minor cleanup items (stale comments, magic numbers, naming conventions) that would improve maintainability but don't block the functional value of these changes. Nice work adding this debugging capability! 🎉
Discarded (7)
| Location | Issue | Reason Discarded |
|---|---|---|
package.json |
@vercel/functions as prod dependency increases bundle | Invalid — this is the fix; the previous dynamic-import-only pattern caused module resolution failures in pnpm's strict isolation |
app-mention.ts |
3000ms thresholds may cause noise on cold starts | Speculative — the thresholds seem reasonable and this is the intended diagnostic purpose |
pnpm-lock.yaml |
Lockfile changes | Proportional to the dependency addition, no issues |
package.json |
License concern | Apache-2.0 is compatible, no action needed |
.changeset/ |
Missing changeset | Changeset already exists covering this change |
| Multi-file | Timing field naming pattern (dispatchDelayMs vs durationMs) |
Informational — the new descriptive naming is appropriate for diagnostic logs with multiple timings |
| Multi-file | Consistent warning message pattern | Positive observation — no action needed |
Reviewers (4)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-sre |
5 | 1 | 0 | 0 | 1 | 0 | 3 |
pr-review-devops |
7 | 1 | 0 | 0 | 2 | 0 | 4 |
pr-review-consistency |
5 | 0 | 2 | 0 | 0 | 0 | 3 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 17 | 2 | 2 | 0 | 3 | 0 | 10 |
| span.setAttribute(SLACK_SPAN_KEYS.IS_IN_THREAD, Boolean(threadTs && threadTs !== messageTs)); | ||
| if (threadTs) span.setAttribute(SLACK_SPAN_KEYS.THREAD_TS, threadTs); | ||
| if (messageTs) span.setAttribute(SLACK_SPAN_KEYS.MESSAGE_TS, messageTs); | ||
| if (dispatchDelayMs !== undefined) span.setAttribute('dispatch_delay_ms', dispatchDelayMs); |
There was a problem hiding this comment.
🟡 Minor: Span attribute uses inconsistent naming convention
Issue: The new span attribute 'dispatch_delay_ms' uses snake_case as a string literal, but the established pattern in this file uses the SLACK_SPAN_KEYS constants with dot-notation (e.g., SLACK_SPAN_KEYS.TEAM_ID → 'slack.team_id').
Why: Inconsistent span attribute naming makes observability governance harder — queries and dashboards may miss this attribute if they expect the slack.* namespace. Using constants also prevents typos and enables IDE autocomplete.
Fix: Add the attribute to SLACK_SPAN_KEYS in tracer.ts and use the constant:
// In tracer.ts, add to SLACK_SPAN_KEYS:
DISPATCH_DELAY_MS: 'slack.dispatch_delay_ms',
// Then here:
if (dispatchDelayMs !== undefined) span.setAttribute(SLACK_SPAN_KEYS.DISPATCH_DELAY_MS, dispatchDelayMs);Refs:
- packages/agents-work-apps/src/slack/tracer.ts:21-41 — existing SLACK_SPAN_KEYS pattern
|
not relevant anymore |
No description provided.