Run lifecycle integrity: lost→success ghost transitions and worker event bugs

Runs are being marked `lost` by the janitor, then transitioning back to `success` up to an hour later when the worker finally delivers a `run:complete` event. This makes the `lost` state unreliable for KPIs, alerting, and capacity reasoning. The behaviour is a symptom of several interacting bugs on the worker side, all of which need to investigated.

This issue collects the full picture and is the coordination point for fixing it correctly.

---

## What actually happens to an affected run

Here's a concrete walk-through from an investigation of three affected runs (`1de52840`, `9f2875c9`, `b8ae37e7`):

1. **~03:00** — Worker sends `step:complete`. Lightning rejects it: `{"step_id": ["This field can't be blank."]}`. The step_id field is simply absent from the payload.

2. **~03:00–04:05** — Worker enters a retry loop. Mixed error responses: `run:log` rejections (`{"step_id": ["must be associated with the run"]}`), plus timeouts. The worker keeps sending. None of these requests will ever succeed — the payload is fails on validation, server is still available/no timeouts.

3. **~04:05** — The janitor runs. The run hasn't had a valid completion event. It marks the run `lost`. KPIs and alerting fire.

4. **~04:45** — The worker (which should have been cleaned up) sends `run:complete`, backdated to ~03:01 when the workflow actually finished. Lightning accepts it. The run transitions `lost → success`.

The run's token expired well before step 4. The channel connection persists because token validation only happens at channel join. The worker is still alive, still sending events, 45 minutes after the janitor gave up on it.

This is the current "recovery" path for these runs. It's accidental and it's masking real bugs.

---

## Why this is a cluster of bugs, not one bug

### Bug 1: Worker constructs `step:complete` without a `step_id` (OpenFn/kit#1072)

The immediate failure is a malformed payload. The step_id is nil or absent when the event is built. We don't know why — it hasn't been investigated. Candidates:

- Race condition: step not fully initialised when the completion handler fires
- Error path: step failed to start but the completion event fires anyway
- Double-send: OpenFn/kit#1072 notes the worker sometimes receives both OK and ERR for the same `step:complete` — is it sending twice, once with a valid id and once without?

Joe's comment on OpenFn/kit#1072: "the moment step:complete was rejected, we know the run is dead." That's correct — but the worker doesn't act on that knowledge.

### Bug 2: Worker retries deterministic failures (OpenFn/kit#1072, implicitly)

`{"step_id": ["This field can't be blank."]}` is a changeset validation error. Retrying with the same payload will never succeed. But the worker's retry logic doesn't distinguish between:

- **Transient failures** (timeout, 503, network blip) → retry is appropriate
- **Validation failures** (422-equivalent, malformed payload) → retry is futile

The worker treats both identically. This turns a single bad event into an hour-long flood of failed requests.

OpenFn/kit#1130 (closed) addressed retrying after timeout. The retry-on-validation-error behaviour remains.

### Bug 3: Worker event machinery outlives the run's timeout

The run's timeout should kill the worker process. The janitor marks runs `lost` several minutes *after* token expiry — there's grace — yet the worker is still alive 45 minutes after that. Either:

- The retry/event-sending loop runs in a process not supervised by the run timeout
- The timeout kills the execution subprocess but not the event machinery
- Something else is keeping it alive

This is the least-understood part. It needs code path tracing in the kit repo.

### Bug 4: Lightning's `update_run/1` was silently ignoring all changeset constraints (#4563)

Found during investigation. `Runs.update_run/1` uses `Repo.update_all` internally. `update_all` discards changeset annotations entirely — meaning `Run.validate_state_change/1` has been dead code in the handler pipeline. The state machine has always declared `lost → success` invalid, but that declaration was never executed.

PR #4563 switches to `Repo.update`, making the existing constraints actually run. Side effect: the `lost → success` transition that currently "recovers" these runs would be rejected instead. The PR is **paused** until the worker-side bugs are understood and fixed, so we don't strand runs that currently self-recover.

### Bug 5: Janitor `Repo.stream` scoping (#2943, minor)

`Janitor.find_and_update_lost/0` calls `Repo.stream` outside a transaction but consumes it inside one. Ecto requires streams to be opened within the transaction. This could cause the janitor to mark runs lost that completed between query execution and stream consumption. The `{:error, :already_completed}` guard at line 50–55 suggests this race was anticipated but the underlying code is still wrong. Low probability of real-world impact, but worth fixing as hygiene.

---

## The issue landscape

Multiple issues across both repos track different facets of this. None are actively being worked on.

| Issue | What it's about |
|---|---|
| OpenFn/kit#1072 | Step:complete sent without step_id. Open. |
| OpenFn/kit#1218 | When workflow-complete is rejected, worker should send workflow-error as fallback. Open. |
| OpenFn/kit#1130 | Worker kept retrying after timeout. **Closed** (partially — changeset retries not addressed). |
| #4355 | `lost` is the wrong label — "we know exactly what happened, Lightning rejected the events." Proposes `crashed`/`killed`. Open. |
| #2943 | Main tracking issue for runs going lost when step:complete fails. Original trigger: Postgres 15s timeout on large payloads. Joe's latest: "no dev in progress." Open. |
| #3565 | Runs marked lost when they should be re-enqueued. Heap memory implicated. Open. |
| #4563 | Enforce changeset constraints in run/step handler pipeline. **Draft, paused.** |

### What's missing

Every open issue proposes a Lightning-side mitigation: relabel the state, be more lenient, accept partial events. Nobody has opened an issue for:

- Why does the worker construct `step:complete` without a step_id?
- Why doesn't the worker distinguish retryable errors from validation failures?
- Why does the worker process outlive the run's timeout?

These are the root causes. The Lightning mitigations are reasonable additions, but they don't fix the underlying behaviour.

---

## Investigation tasks

- [ ] **Quantify prevalence**: query for runs that have transitioned `lost → success`. How many? Over what time window? Is it always the step_id-blank pattern or are there other triggers?
- [ ] **For affected runs**: what's the distribution of time between `lost` marking and the recovery `run:complete`? Are the same workflow/adaptor combinations over-represented?
- [ ] **Trace step:complete construction in kit**: where does step_id get populated? What code paths could produce a nil/absent step_id? Is there a double-send path?
- [ ] **Trace retry logic in kit**: where is the retry loop? Does it inspect the error type or HTTP status at all? Is there any classification of retryable vs non-retryable?
- [ ] **Trace worker process lifecycle**: what process is responsible for sending events? Is it the same process subject to run timeout? What happens to it when the run times out?
- [ ] **Fix janitor stream scoping**: move `Repo.stream` inside the transaction in `Janitor.find_and_update_lost/0`. Low priority but straightforward.

---

## Proposed fix sequence

**Step 1 — Investigate (kit repo)**

Trace the three code paths above. Open focused issues (or sub-tasks here) for what's found. Don't propose fixes yet — understand the mechanics first.

**Step 2 — Fix the worker**

In rough priority order:
1. Stop sending `step:complete` without a step_id. If step_id is nil, treat the step as failed immediately rather than constructing a broken event.
2. Classify errors in the retry loop. Validation failures (payload rejected by the server) should not be retried. Surface them as terminal errors.
3. Ensure event-sending machinery is within the scope of the run's timeout — not just the execution subprocess.
4. Implement OpenFn/kit#1218: when `workflow-complete` is rejected, attempt to send `workflow-error` as a fallback so the run reaches a known terminal state.

**Step 3 — Merge Lightning PR #4563**

Once runs aren't being incorrectly lost, enforcing the changeset constraints is safe to ship. The `lost → success` transition should be rejected because the state machine always said it was invalid — making that declaration execute is correct.

**Step 4 — Lightning-side improvements (parallel, lower urgency)**

#4355 (better state labels like `crashed`/`killed`) and improved leniency around large payloads (#2943's original trigger) are reasonable improvements that don't depend on the worker fixes. Can be worked in parallel. But shouldn't by taken on if they muddy the previous concerns.

---

## Interim options if #4563 needs to unblock before worker fixes land

Three options described in the PR itself:

1. **Janitor yields to worker**: allow `final → final` transitions when the source state is `:lost` — the janitor's marking doesn't prevent a later legitimate terminal event.
2. **Graceful "too late" response**: Lightning returns a structured non-error response when the run is already in a terminal state, so the worker doesn't spiral on retries.
3. **Accept the new behaviour**: `lost` stays `lost`, no recovery. KPIs get worse short-term but the state becomes meaningful. Acceptable only if worker fixes are imminent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run lifecycle integrity: lost→success ghost transitions and worker event bugs #4565

What actually happens to an affected run

Why this is a cluster of bugs, not one bug

Bug 1: Worker constructs `step:complete` without a `step_id` (OpenFn/kit#1072)

Bug 2: Worker retries deterministic failures (OpenFn/kit#1072, implicitly)

Bug 3: Worker event machinery outlives the run's timeout

Bug 4: Lightning's `update_run/1` was silently ignoring all changeset constraints (#4563)

Bug 5: Janitor `Repo.stream` scoping (#2943, minor)

The issue landscape

What's missing

Investigation tasks

Proposed fix sequence

Interim options if #4563 needs to unblock before worker fixes land

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue	What it's about
OpenFn/kit#1072	Step:complete sent without step_id. Open.
OpenFn/kit#1218	When workflow-complete is rejected, worker should send workflow-error as fallback. Open.
OpenFn/kit#1130	Worker kept retrying after timeout. Closed (partially — changeset retries not addressed).
#4355	`lost` is the wrong label — "we know exactly what happened, Lightning rejected the events." Proposes `crashed`/`killed`. Open.
#2943	Main tracking issue for runs going lost when step:complete fails. Original trigger: Postgres 15s timeout on large payloads. Joe's latest: "no dev in progress." Open.
#3565	Runs marked lost when they should be re-enqueued. Heap memory implicated. Open.
#4563	Enforce changeset constraints in run/step handler pipeline. Draft, paused.

Run lifecycle integrity: lost→success ghost transitions and worker event bugs #4565

Description

What actually happens to an affected run

Why this is a cluster of bugs, not one bug

Bug 1: Worker constructs step:complete without a step_id (OpenFn/kit#1072)

Bug 2: Worker retries deterministic failures (OpenFn/kit#1072, implicitly)

Bug 3: Worker event machinery outlives the run's timeout

Bug 4: Lightning's update_run/1 was silently ignoring all changeset constraints (#4563)

Bug 5: Janitor Repo.stream scoping (#2943, minor)

The issue landscape

What's missing

Investigation tasks

Proposed fix sequence

Interim options if #4563 needs to unblock before worker fixes land

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: Worker constructs `step:complete` without a `step_id` (OpenFn/kit#1072)

Bug 4: Lightning's `update_run/1` was silently ignoring all changeset constraints (#4563)

Bug 5: Janitor `Repo.stream` scoping (#2943, minor)