Summary
AdaptiveTimeoutPolicy (packages/pipeline/src/sparql/timeoutPolicy.ts) can drop a large block of analysis results from a healthy endpoint. Once a couple of genuinely-slow-but-fine queries time out, the policy clamps the endpoint to the tightened budget (10s as configured by the DKG), which is far below the legitimate cost of the heavy VoID queries — so every subsequent heavy query times out as collateral, and the endpoint stays clamped until some unrelated fast query happens to land. The net effect is that one expensive query poisons a dozen following stages even though the endpoint is responding normally.
Observed (real run against a large endpoint)
DKG config: adaptiveTimeoutPolicy({ defaultMs: 300_000, tightenedMs: 10_000, tightenAfterTimeouts: 2 }).
✔ subjects.rq 1m 35s
✔ properties.rq 11.9s
✔ object-literals.rq 2m 8s
✔ object-uris.rq 2m 6s
✖ datatypes.rq aborted (exceeded the 5m budget, retried, timed out again)
↘ Tightened timeout to 10s after 2 consecutive timeouts
✖ triples.rq aborted (needs ~9s normally — but now budget is 10s and the endpoint is busy)
✖ class-partition.rq aborted
✖ class-properties-* aborted (×6)
✖ licenses.rq aborted
✖ entity-properties.rq aborted
✖ subject-uri-space.rq aborted
✖ object-uri-space.rq aborted
✔ shacl-sample-CreativeWork 1.4s ← fast query, fits in 10s
↗ Relaxed timeout back to 5m after successful request
✖ media.rq aborted
✔ iiif.rq 3m 59s
10 of ~20 analysis stages were lost. Only datatypes.rq actually hit a real server-side limit; the rest are collateral — they each legitimately take seconds-to-minutes but were given a 10s budget. The single fast SHACL-sample query is the only thing that relaxed the endpoint, after which iiif.rq (4 min) succeeded again — proving the endpoint was healthy the whole time.
Why it misfires
- The tightened budget conflates two different conditions. “Fast-fail a dead endpoint” and “this particular query is expensive” are not the same. A 10s tightened budget is below the normal latency of the heavy VoID queries, so once tightened, healthy heavy queries cannot pass — they are guaranteed to time out.
- Tightened state is self-sustaining. In
afterRequest, a timeout outcome at the tightened budget keeps incrementing consecutiveTimeouts and keeps tightened = true. The only escape is an ok, but the very queries being clamped need more than the tightened budget to return ok — so the endpoint can only recover if some coincidentally-cheap query runs. A run of heavy stages back-to-back never recovers on its own.
- No re-probe / decay. The policy never periodically retries at the default budget to check whether the endpoint is actually still slow, so it stays pessimistic indefinitely.
Possible directions (not prescriptive)
- Make
tightenedMs less aggressive, or derive it from observed successful durations on that endpoint (e.g. a multiple of the slowest recent ok), rather than a flat 10s that is below legitimate query cost.
- Periodically re-probe at
defaultMs (decay the tightened state after N requests or T seconds) so a healthy-but-slow endpoint recovers without needing a lucky fast query.
- Distinguish “the endpoint is unreachable / failing immediately” (connection errors, instant failures — good signal to fast-fail) from “this query is just slow” (a single long-running request that is otherwise progressing).
- Consider tagging known-heavy stages so a timeout on an expensive query doesn’t tighten the budget for cheaper ones.
Impact
Datasets on large but healthy endpoints get a substantially incomplete VoID summary (missing triple counts, class partition, class/property breakdowns) whenever one heavy query trips the tighten. This was masked until an upstream proxy timeout was fixed; with that gone, the adaptive policy is now the dominant cause of incomplete results for large datasets.
Summary
AdaptiveTimeoutPolicy(packages/pipeline/src/sparql/timeoutPolicy.ts) can drop a large block of analysis results from a healthy endpoint. Once a couple of genuinely-slow-but-fine queries time out, the policy clamps the endpoint to the tightened budget (10s as configured by the DKG), which is far below the legitimate cost of the heavy VoID queries — so every subsequent heavy query times out as collateral, and the endpoint stays clamped until some unrelated fast query happens to land. The net effect is that one expensive query poisons a dozen following stages even though the endpoint is responding normally.Observed (real run against a large endpoint)
DKG config:
adaptiveTimeoutPolicy({ defaultMs: 300_000, tightenedMs: 10_000, tightenAfterTimeouts: 2 }).10 of ~20 analysis stages were lost. Only
datatypes.rqactually hit a real server-side limit; the rest are collateral — they each legitimately take seconds-to-minutes but were given a 10s budget. The single fast SHACL-sample query is the only thing that relaxed the endpoint, after whichiiif.rq(4 min) succeeded again — proving the endpoint was healthy the whole time.Why it misfires
afterRequest, atimeoutoutcome at the tightened budget keeps incrementingconsecutiveTimeoutsand keepstightened = true. The only escape is anok, but the very queries being clamped need more than the tightened budget to returnok— so the endpoint can only recover if some coincidentally-cheap query runs. A run of heavy stages back-to-back never recovers on its own.Possible directions (not prescriptive)
tightenedMsless aggressive, or derive it from observed successful durations on that endpoint (e.g. a multiple of the slowest recentok), rather than a flat 10s that is below legitimate query cost.defaultMs(decay the tightened state after N requests or T seconds) so a healthy-but-slow endpoint recovers without needing a lucky fast query.Impact
Datasets on large but healthy endpoints get a substantially incomplete VoID summary (missing triple counts, class partition, class/property breakdowns) whenever one heavy query trips the tighten. This was masked until an upstream proxy timeout was fixed; with that gone, the adaptive policy is now the dominant cause of incomplete results for large datasets.