You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Runs are being marked lost by the janitor, then transitioning back to success up to an hour later when the worker finally delivers a run:complete event. This makes the lost state unreliable for KPIs, alerting, and capacity reasoning. The behaviour is a symptom of several interacting bugs on the worker side, all of which need to investigated.
This issue collects the full picture and is the coordination point for fixing it correctly.
What actually happens to an affected run
Here's a concrete walk-through from an investigation of three affected runs (1de52840, 9f2875c9, b8ae37e7):
~03:00 — Worker sends step:complete. Lightning rejects it: {"step_id": ["This field can't be blank."]}. The step_id field is simply absent from the payload.
~03:00–04:05 — Worker enters a retry loop. Mixed error responses: run:log rejections ({"step_id": ["must be associated with the run"]}), plus timeouts. The worker keeps sending. None of these requests will ever succeed — the payload is fails on validation, server is still available/no timeouts.
~04:05 — The janitor runs. The run hasn't had a valid completion event. It marks the run lost. KPIs and alerting fire.
~04:45 — The worker (which should have been cleaned up) sends run:complete, backdated to ~03:01 when the workflow actually finished. Lightning accepts it. The run transitions lost → success.
The run's token expired well before step 4. The channel connection persists because token validation only happens at channel join. The worker is still alive, still sending events, 45 minutes after the janitor gave up on it.
This is the current "recovery" path for these runs. It's accidental and it's masking real bugs.
Why this is a cluster of bugs, not one bug
Bug 1: Worker constructs step:complete without a step_id (OpenFn/kit#1072)
The immediate failure is a malformed payload. The step_id is nil or absent when the event is built. We don't know why — it hasn't been investigated. Candidates:
Race condition: step not fully initialised when the completion handler fires
Error path: step failed to start but the completion event fires anyway
Joe's comment on OpenFn/kit#1072: "the moment step:complete was rejected, we know the run is dead." That's correct — but the worker doesn't act on that knowledge.
{"step_id": ["This field can't be blank."]} is a changeset validation error. Retrying with the same payload will never succeed. But the worker's retry logic doesn't distinguish between:
Transient failures (timeout, 503, network blip) → retry is appropriate
Validation failures (422-equivalent, malformed payload) → retry is futile
The worker treats both identically. This turns a single bad event into an hour-long flood of failed requests.
OpenFn/kit#1130 (closed) addressed retrying after timeout. The retry-on-validation-error behaviour remains.
Bug 3: Worker event machinery outlives the run's timeout
The run's timeout should kill the worker process. The janitor marks runs lost several minutes after token expiry — there's grace — yet the worker is still alive 45 minutes after that. Either:
The retry/event-sending loop runs in a process not supervised by the run timeout
The timeout kills the execution subprocess but not the event machinery
Something else is keeping it alive
This is the least-understood part. It needs code path tracing in the kit repo.
Bug 4: Lightning's update_run/1 was silently ignoring all changeset constraints (#4563)
Found during investigation. Runs.update_run/1 uses Repo.update_all internally. update_all discards changeset annotations entirely — meaning Run.validate_state_change/1 has been dead code in the handler pipeline. The state machine has always declared lost → success invalid, but that declaration was never executed.
PR #4563 switches to Repo.update, making the existing constraints actually run. Side effect: the lost → success transition that currently "recovers" these runs would be rejected instead. The PR is paused until the worker-side bugs are understood and fixed, so we don't strand runs that currently self-recover.
Janitor.find_and_update_lost/0 calls Repo.stream outside a transaction but consumes it inside one. Ecto requires streams to be opened within the transaction. This could cause the janitor to mark runs lost that completed between query execution and stream consumption. The {:error, :already_completed} guard at line 50–55 suggests this race was anticipated but the underlying code is still wrong. Low probability of real-world impact, but worth fixing as hygiene.
The issue landscape
Multiple issues across both repos track different facets of this. None are actively being worked on.
Main tracking issue for runs going lost when step:complete fails. Original trigger: Postgres 15s timeout on large payloads. Joe's latest: "no dev in progress." Open.
Enforce changeset constraints in run/step handler pipeline. Draft, paused.
What's missing
Every open issue proposes a Lightning-side mitigation: relabel the state, be more lenient, accept partial events. Nobody has opened an issue for:
Why does the worker construct step:complete without a step_id?
Why doesn't the worker distinguish retryable errors from validation failures?
Why does the worker process outlive the run's timeout?
These are the root causes. The Lightning mitigations are reasonable additions, but they don't fix the underlying behaviour.
Investigation tasks
Quantify prevalence: query for runs that have transitioned lost → success. How many? Over what time window? Is it always the step_id-blank pattern or are there other triggers?
For affected runs: what's the distribution of time between lost marking and the recovery run:complete? Are the same workflow/adaptor combinations over-represented?
Trace step:complete construction in kit: where does step_id get populated? What code paths could produce a nil/absent step_id? Is there a double-send path?
Trace retry logic in kit: where is the retry loop? Does it inspect the error type or HTTP status at all? Is there any classification of retryable vs non-retryable?
Trace worker process lifecycle: what process is responsible for sending events? Is it the same process subject to run timeout? What happens to it when the run times out?
Fix janitor stream scoping: move Repo.stream inside the transaction in Janitor.find_and_update_lost/0. Low priority but straightforward.
Proposed fix sequence
Step 1 — Investigate (kit repo)
Trace the three code paths above. Open focused issues (or sub-tasks here) for what's found. Don't propose fixes yet — understand the mechanics first.
Step 2 — Fix the worker
In rough priority order:
Stop sending step:complete without a step_id. If step_id is nil, treat the step as failed immediately rather than constructing a broken event.
Classify errors in the retry loop. Validation failures (payload rejected by the server) should not be retried. Surface them as terminal errors.
Ensure event-sending machinery is within the scope of the run's timeout — not just the execution subprocess.
Once runs aren't being incorrectly lost, enforcing the changeset constraints is safe to ship. The lost → success transition should be rejected because the state machine always said it was invalid — making that declaration execute is correct.
#4355 (better state labels like crashed/killed) and improved leniency around large payloads (#2943's original trigger) are reasonable improvements that don't depend on the worker fixes. Can be worked in parallel. But shouldn't by taken on if they muddy the previous concerns.
Interim options if #4563 needs to unblock before worker fixes land
Three options described in the PR itself:
Janitor yields to worker: allow final → final transitions when the source state is :lost — the janitor's marking doesn't prevent a later legitimate terminal event.
Graceful "too late" response: Lightning returns a structured non-error response when the run is already in a terminal state, so the worker doesn't spiral on retries.
Accept the new behaviour: lost stays lost, no recovery. KPIs get worse short-term but the state becomes meaningful. Acceptable only if worker fixes are imminent.
Runs are being marked
lostby the janitor, then transitioning back tosuccessup to an hour later when the worker finally delivers arun:completeevent. This makes theloststate unreliable for KPIs, alerting, and capacity reasoning. The behaviour is a symptom of several interacting bugs on the worker side, all of which need to investigated.This issue collects the full picture and is the coordination point for fixing it correctly.
What actually happens to an affected run
Here's a concrete walk-through from an investigation of three affected runs (
1de52840,9f2875c9,b8ae37e7):~03:00 — Worker sends
step:complete. Lightning rejects it:{"step_id": ["This field can't be blank."]}. The step_id field is simply absent from the payload.~03:00–04:05 — Worker enters a retry loop. Mixed error responses:
run:logrejections ({"step_id": ["must be associated with the run"]}), plus timeouts. The worker keeps sending. None of these requests will ever succeed — the payload is fails on validation, server is still available/no timeouts.~04:05 — The janitor runs. The run hasn't had a valid completion event. It marks the run
lost. KPIs and alerting fire.~04:45 — The worker (which should have been cleaned up) sends
run:complete, backdated to ~03:01 when the workflow actually finished. Lightning accepts it. The run transitionslost → success.The run's token expired well before step 4. The channel connection persists because token validation only happens at channel join. The worker is still alive, still sending events, 45 minutes after the janitor gave up on it.
This is the current "recovery" path for these runs. It's accidental and it's masking real bugs.
Why this is a cluster of bugs, not one bug
Bug 1: Worker constructs
step:completewithout astep_id(OpenFn/kit#1072)The immediate failure is a malformed payload. The step_id is nil or absent when the event is built. We don't know why — it hasn't been investigated. Candidates:
step:complete— is it sending twice, once with a valid id and once without?Joe's comment on OpenFn/kit#1072: "the moment step:complete was rejected, we know the run is dead." That's correct — but the worker doesn't act on that knowledge.
Bug 2: Worker retries deterministic failures (OpenFn/kit#1072, implicitly)
{"step_id": ["This field can't be blank."]}is a changeset validation error. Retrying with the same payload will never succeed. But the worker's retry logic doesn't distinguish between:The worker treats both identically. This turns a single bad event into an hour-long flood of failed requests.
OpenFn/kit#1130 (closed) addressed retrying after timeout. The retry-on-validation-error behaviour remains.
Bug 3: Worker event machinery outlives the run's timeout
The run's timeout should kill the worker process. The janitor marks runs
lostseveral minutes after token expiry — there's grace — yet the worker is still alive 45 minutes after that. Either:This is the least-understood part. It needs code path tracing in the kit repo.
Bug 4: Lightning's
update_run/1was silently ignoring all changeset constraints (#4563)Found during investigation.
Runs.update_run/1usesRepo.update_allinternally.update_alldiscards changeset annotations entirely — meaningRun.validate_state_change/1has been dead code in the handler pipeline. The state machine has always declaredlost → successinvalid, but that declaration was never executed.PR #4563 switches to
Repo.update, making the existing constraints actually run. Side effect: thelost → successtransition that currently "recovers" these runs would be rejected instead. The PR is paused until the worker-side bugs are understood and fixed, so we don't strand runs that currently self-recover.Bug 5: Janitor
Repo.streamscoping (#2943, minor)Janitor.find_and_update_lost/0callsRepo.streamoutside a transaction but consumes it inside one. Ecto requires streams to be opened within the transaction. This could cause the janitor to mark runs lost that completed between query execution and stream consumption. The{:error, :already_completed}guard at line 50–55 suggests this race was anticipated but the underlying code is still wrong. Low probability of real-world impact, but worth fixing as hygiene.The issue landscape
Multiple issues across both repos track different facets of this. None are actively being worked on.
lostis the wrong label — "we know exactly what happened, Lightning rejected the events." Proposescrashed/killed. Open.What's missing
Every open issue proposes a Lightning-side mitigation: relabel the state, be more lenient, accept partial events. Nobody has opened an issue for:
step:completewithout a step_id?These are the root causes. The Lightning mitigations are reasonable additions, but they don't fix the underlying behaviour.
Investigation tasks
lost → success. How many? Over what time window? Is it always the step_id-blank pattern or are there other triggers?lostmarking and the recoveryrun:complete? Are the same workflow/adaptor combinations over-represented?Repo.streaminside the transaction inJanitor.find_and_update_lost/0. Low priority but straightforward.Proposed fix sequence
Step 1 — Investigate (kit repo)
Trace the three code paths above. Open focused issues (or sub-tasks here) for what's found. Don't propose fixes yet — understand the mechanics first.
Step 2 — Fix the worker
In rough priority order:
step:completewithout a step_id. If step_id is nil, treat the step as failed immediately rather than constructing a broken event.workflow-completeis rejected, try to send aworkfkow-errorkit#1218: whenworkflow-completeis rejected, attempt to sendworkflow-erroras a fallback so the run reaches a known terminal state.Step 3 — Merge Lightning PR #4563
Once runs aren't being incorrectly lost, enforcing the changeset constraints is safe to ship. The
lost → successtransition should be rejected because the state machine always said it was invalid — making that declaration execute is correct.Step 4 — Lightning-side improvements (parallel, lower urgency)
#4355 (better state labels like
crashed/killed) and improved leniency around large payloads (#2943's original trigger) are reasonable improvements that don't depend on the worker fixes. Can be worked in parallel. But shouldn't by taken on if they muddy the previous concerns.Interim options if #4563 needs to unblock before worker fixes land
Three options described in the PR itself:
final → finaltransitions when the source state is:lost— the janitor's marking doesn't prevent a later legitimate terminal event.loststayslost, no recovery. KPIs get worse short-term but the state becomes meaningful. Acceptable only if worker fixes are imminent.