Skip to content

Run lifecycle integrity: lost→success ghost transitions and worker event bugs #4565

@stuartc

Description

@stuartc

Runs are being marked lost by the janitor, then transitioning back to success up to an hour later when the worker finally delivers a run:complete event. This makes the lost state unreliable for KPIs, alerting, and capacity reasoning. The behaviour is a symptom of several interacting bugs on the worker side, all of which need to investigated.

This issue collects the full picture and is the coordination point for fixing it correctly.


What actually happens to an affected run

Here's a concrete walk-through from an investigation of three affected runs (1de52840, 9f2875c9, b8ae37e7):

  1. ~03:00 — Worker sends step:complete. Lightning rejects it: {"step_id": ["This field can't be blank."]}. The step_id field is simply absent from the payload.

  2. ~03:00–04:05 — Worker enters a retry loop. Mixed error responses: run:log rejections ({"step_id": ["must be associated with the run"]}), plus timeouts. The worker keeps sending. None of these requests will ever succeed — the payload is fails on validation, server is still available/no timeouts.

  3. ~04:05 — The janitor runs. The run hasn't had a valid completion event. It marks the run lost. KPIs and alerting fire.

  4. ~04:45 — The worker (which should have been cleaned up) sends run:complete, backdated to ~03:01 when the workflow actually finished. Lightning accepts it. The run transitions lost → success.

The run's token expired well before step 4. The channel connection persists because token validation only happens at channel join. The worker is still alive, still sending events, 45 minutes after the janitor gave up on it.

This is the current "recovery" path for these runs. It's accidental and it's masking real bugs.


Why this is a cluster of bugs, not one bug

Bug 1: Worker constructs step:complete without a step_id (OpenFn/kit#1072)

The immediate failure is a malformed payload. The step_id is nil or absent when the event is built. We don't know why — it hasn't been investigated. Candidates:

  • Race condition: step not fully initialised when the completion handler fires
  • Error path: step failed to start but the completion event fires anyway
  • Double-send: Worker: sometimes sends events without a step id kit#1072 notes the worker sometimes receives both OK and ERR for the same step:complete — is it sending twice, once with a valid id and once without?

Joe's comment on OpenFn/kit#1072: "the moment step:complete was rejected, we know the run is dead." That's correct — but the worker doesn't act on that knowledge.

Bug 2: Worker retries deterministic failures (OpenFn/kit#1072, implicitly)

{"step_id": ["This field can't be blank."]} is a changeset validation error. Retrying with the same payload will never succeed. But the worker's retry logic doesn't distinguish between:

  • Transient failures (timeout, 503, network blip) → retry is appropriate
  • Validation failures (422-equivalent, malformed payload) → retry is futile

The worker treats both identically. This turns a single bad event into an hour-long flood of failed requests.

OpenFn/kit#1130 (closed) addressed retrying after timeout. The retry-on-validation-error behaviour remains.

Bug 3: Worker event machinery outlives the run's timeout

The run's timeout should kill the worker process. The janitor marks runs lost several minutes after token expiry — there's grace — yet the worker is still alive 45 minutes after that. Either:

  • The retry/event-sending loop runs in a process not supervised by the run timeout
  • The timeout kills the execution subprocess but not the event machinery
  • Something else is keeping it alive

This is the least-understood part. It needs code path tracing in the kit repo.

Bug 4: Lightning's update_run/1 was silently ignoring all changeset constraints (#4563)

Found during investigation. Runs.update_run/1 uses Repo.update_all internally. update_all discards changeset annotations entirely — meaning Run.validate_state_change/1 has been dead code in the handler pipeline. The state machine has always declared lost → success invalid, but that declaration was never executed.

PR #4563 switches to Repo.update, making the existing constraints actually run. Side effect: the lost → success transition that currently "recovers" these runs would be rejected instead. The PR is paused until the worker-side bugs are understood and fixed, so we don't strand runs that currently self-recover.

Bug 5: Janitor Repo.stream scoping (#2943, minor)

Janitor.find_and_update_lost/0 calls Repo.stream outside a transaction but consumes it inside one. Ecto requires streams to be opened within the transaction. This could cause the janitor to mark runs lost that completed between query execution and stream consumption. The {:error, :already_completed} guard at line 50–55 suggests this race was anticipated but the underlying code is still wrong. Low probability of real-world impact, but worth fixing as hygiene.


The issue landscape

Multiple issues across both repos track different facets of this. None are actively being worked on.

Issue What it's about
OpenFn/kit#1072 Step:complete sent without step_id. Open.
OpenFn/kit#1218 When workflow-complete is rejected, worker should send workflow-error as fallback. Open.
OpenFn/kit#1130 Worker kept retrying after timeout. Closed (partially — changeset retries not addressed).
#4355 lost is the wrong label — "we know exactly what happened, Lightning rejected the events." Proposes crashed/killed. Open.
#2943 Main tracking issue for runs going lost when step:complete fails. Original trigger: Postgres 15s timeout on large payloads. Joe's latest: "no dev in progress." Open.
#3565 Runs marked lost when they should be re-enqueued. Heap memory implicated. Open.
#4563 Enforce changeset constraints in run/step handler pipeline. Draft, paused.

What's missing

Every open issue proposes a Lightning-side mitigation: relabel the state, be more lenient, accept partial events. Nobody has opened an issue for:

  • Why does the worker construct step:complete without a step_id?
  • Why doesn't the worker distinguish retryable errors from validation failures?
  • Why does the worker process outlive the run's timeout?

These are the root causes. The Lightning mitigations are reasonable additions, but they don't fix the underlying behaviour.


Investigation tasks

  • Quantify prevalence: query for runs that have transitioned lost → success. How many? Over what time window? Is it always the step_id-blank pattern or are there other triggers?
  • For affected runs: what's the distribution of time between lost marking and the recovery run:complete? Are the same workflow/adaptor combinations over-represented?
  • Trace step:complete construction in kit: where does step_id get populated? What code paths could produce a nil/absent step_id? Is there a double-send path?
  • Trace retry logic in kit: where is the retry loop? Does it inspect the error type or HTTP status at all? Is there any classification of retryable vs non-retryable?
  • Trace worker process lifecycle: what process is responsible for sending events? Is it the same process subject to run timeout? What happens to it when the run times out?
  • Fix janitor stream scoping: move Repo.stream inside the transaction in Janitor.find_and_update_lost/0. Low priority but straightforward.

Proposed fix sequence

Step 1 — Investigate (kit repo)

Trace the three code paths above. Open focused issues (or sub-tasks here) for what's found. Don't propose fixes yet — understand the mechanics first.

Step 2 — Fix the worker

In rough priority order:

  1. Stop sending step:complete without a step_id. If step_id is nil, treat the step as failed immediately rather than constructing a broken event.
  2. Classify errors in the retry loop. Validation failures (payload rejected by the server) should not be retried. Surface them as terminal errors.
  3. Ensure event-sending machinery is within the scope of the run's timeout — not just the execution subprocess.
  4. Implement Worker: when workflow-complete is rejected, try to send a workfkow-error kit#1218: when workflow-complete is rejected, attempt to send workflow-error as a fallback so the run reaches a known terminal state.

Step 3 — Merge Lightning PR #4563

Once runs aren't being incorrectly lost, enforcing the changeset constraints is safe to ship. The lost → success transition should be rejected because the state machine always said it was invalid — making that declaration execute is correct.

Step 4 — Lightning-side improvements (parallel, lower urgency)

#4355 (better state labels like crashed/killed) and improved leniency around large payloads (#2943's original trigger) are reasonable improvements that don't depend on the worker fixes. Can be worked in parallel. But shouldn't by taken on if they muddy the previous concerns.


Interim options if #4563 needs to unblock before worker fixes land

Three options described in the PR itself:

  1. Janitor yields to worker: allow final → final transitions when the source state is :lost — the janitor's marking doesn't prevent a later legitimate terminal event.
  2. Graceful "too late" response: Lightning returns a structured non-error response when the run is already in a terminal state, so the worker doesn't spiral on retries.
  3. Accept the new behaviour: lost stays lost, no recovery. KPIs get worse short-term but the state becomes meaningful. Acceptable only if worker fixes are imminent.

Metadata

Metadata

Assignees

Labels

lost runsTracking issues related to lost runs

Type

No type

Projects

Status

Tech Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions