AdaptiveTimeoutPolicy drops a block of stages from a healthy endpoint after one slow query

## Summary

`AdaptiveTimeoutPolicy` (`packages/pipeline/src/sparql/timeoutPolicy.ts`) can drop a large block of analysis results from a **healthy** endpoint. Once a couple of genuinely-slow-but-fine queries time out, the policy clamps the endpoint to the tightened budget (10s as configured by the DKG), which is far below the legitimate cost of the heavy VoID queries — so every subsequent heavy query times out as collateral, and the endpoint stays clamped until some unrelated *fast* query happens to land. The net effect is that one expensive query poisons a dozen following stages even though the endpoint is responding normally.

## Observed (real run against a large endpoint)

DKG config: `adaptiveTimeoutPolicy({ defaultMs: 300_000, tightenedMs: 10_000, tightenAfterTimeouts: 2 })`.

```
✔ subjects.rq            1m 35s
✔ properties.rq            11.9s
✔ object-literals.rq     2m 8s
✔ object-uris.rq         2m 6s
✖ datatypes.rq           aborted        (exceeded the 5m budget, retried, timed out again)
   ↘ Tightened timeout to 10s after 2 consecutive timeouts
✖ triples.rq             aborted        (needs ~9s normally — but now budget is 10s and the endpoint is busy)
✖ class-partition.rq     aborted
✖ class-properties-*     aborted   (×6)
✖ licenses.rq            aborted
✖ entity-properties.rq   aborted
✖ subject-uri-space.rq   aborted
✖ object-uri-space.rq    aborted
✔ shacl-sample-CreativeWork  1.4s         ← fast query, fits in 10s
   ↗ Relaxed timeout back to 5m after successful request
✖ media.rq               aborted
✔ iiif.rq                3m 59s
```

10 of ~20 analysis stages were lost. Only `datatypes.rq` actually hit a real server-side limit; the rest are collateral — they each legitimately take seconds-to-minutes but were given a 10s budget. The single fast SHACL-sample query is the only thing that relaxed the endpoint, after which `iiif.rq` (4 min) succeeded again — proving the endpoint was healthy the whole time.

## Why it misfires

1. **The tightened budget conflates two different conditions.** “Fast-fail a dead endpoint” and “this particular query is expensive” are not the same. A 10s tightened budget is below the normal latency of the heavy VoID queries, so once tightened, *healthy* heavy queries cannot pass — they are guaranteed to time out.
2. **Tightened state is self-sustaining.** In `afterRequest`, a `timeout` outcome at the tightened budget keeps incrementing `consecutiveTimeouts` and keeps `tightened = true`. The only escape is an `ok`, but the very queries being clamped need *more* than the tightened budget to return `ok` — so the endpoint can only recover if some coincidentally-cheap query runs. A run of heavy stages back-to-back never recovers on its own.
3. **No re-probe / decay.** The policy never periodically retries at the default budget to check whether the endpoint is actually still slow, so it stays pessimistic indefinitely.

## Possible directions (not prescriptive)

- Make `tightenedMs` less aggressive, or derive it from observed successful durations on that endpoint (e.g. a multiple of the slowest recent `ok`), rather than a flat 10s that is below legitimate query cost.
- Periodically re-probe at `defaultMs` (decay the tightened state after N requests or T seconds) so a healthy-but-slow endpoint recovers without needing a lucky fast query.
- Distinguish “the endpoint is unreachable / failing immediately” (connection errors, instant failures — good signal to fast-fail) from “this query is just slow” (a single long-running request that is otherwise progressing).
- Consider tagging known-heavy stages so a timeout on an expensive query doesn’t tighten the budget for cheaper ones.

## Impact

Datasets on large but healthy endpoints get a substantially incomplete VoID summary (missing triple counts, class partition, class/property breakdowns) whenever one heavy query trips the tighten. This was masked until an upstream proxy timeout was fixed; with that gone, the adaptive policy is now the dominant cause of incomplete results for large datasets.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AdaptiveTimeoutPolicy drops a block of stages from a healthy endpoint after one slow query #513

Summary

Observed (real run against a large endpoint)

Why it misfires

Possible directions (not prescriptive)

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

AdaptiveTimeoutPolicy drops a block of stages from a healthy endpoint after one slow query #513

Description

Summary

Observed (real run against a large endpoint)

Why it misfires

Possible directions (not prescriptive)

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions