Large datasets OOM-kill the whole pipeline run (QLever query memory unbounded by memory-max-size; isolate QLever or cap per-dataset memory)

## Summary

A single oversized dataset can make QLever consume ~13 GiB+ of query memory during analysis stages, which **OOM-kills the whole pipeline pod** and aborts the entire run — all remaining datasets go unprocessed. Because the pipeline runs Node and native QLever in the **same container/cgroup**, QLever’s OOM takes the Node process down with it. Neither `memory-max-size` nor `cache-max-size` bounds the peak.

This was found while running the [dataset-knowledge-graph](https://github.com/netwerk-digitaal-erfgoed/dataset-knowledge-graph) pipeline (which builds on `@lde/pipeline` + `@lde/sparql-qlever`) on a 16 GiB Kubernetes pod, but the underlying behaviour is in LDE.

## Repro / evidence

Dataset: an openarchieven linkset, **~219.9 M triples**. Run locally in the dkg image (Node + native `qlever-server` `a14e0a0` in one container), memory-capped, processing only that dataset.

- **Peak is ~13 GiB of anonymous (query) memory**, reached transiently during heavy analysis stages and released between them. The peak stage **varies across runs**: `datatypes.rq`, `class-properties-subjects.rq`, `object-uris.rq`, `shacl-sample-*`. So it is the dataset’s **scale**, not one pathological query.
- **`memory-max-size` does not bound the peak:** measured anon peak ≈ 13.0–13.3 GiB at caps of `8G`, `10G`, and `16G`. The cap only changes the *failure mode*: set **below** the container limit it turns hard kills into graceful `HTTP 500` aborts (caught per stage, run continues, summary partial); set **at/above** the container limit the container is OOM-killed.
- **`cache-max-size` does not bound the peak:** anon peak ≈ 13.0 GiB at both `cache-max-size=1G` and the default `30G`.
- **`OOMKilled=true`** confirmed via the container’s cgroup state.

## Why it kills the whole run

- The dkg image is `FROM ${QLEVER_IMAGE}` and runs `qlever-server` **native, in the same container** as the Node pipeline (`NativeTaskRunner`). One cgroup, one memory limit.
- On Kubernetes the container OOM (with `memory.oom.group=1`) kills **all** processes in the container cgroup atomically → Node dies too → `backoffLimit` reached → run aborts.
- Locally (Docker, `oom.group=0`) the OOM killed only `qlever-server`; Node survived, caught the subsequent `fetch failed` errors per stage, and the run **completed** (exit 0) with a partial summary for that dataset. That difference (blast radius) is the crux.

## Impact

One dataset that exceeds the memory budget aborts the entire pipeline run, losing all datasets scheduled after it. On a memory-constrained, Burstable pod (request < limit) it is also a prime eviction/OOM target under node pressure, so the failure point is non-deterministic.

## Proposed directions

1. **Isolate QLever from the pipeline process** so a QLever OOM cannot kill the run. Options: run `qlever-server` as a **separate container/cgroup** (sidecar) the pipeline talks to over the network, with a shared volume for index files and a small supervisor that (re)loads the per-dataset index. Requires a new `@lde/sparql-qlever` topology (today importer + server share one `TaskRunner` in one process). Net effect: an over-budget dataset degrades to a partial summary instead of killing the run.
2. **Bound per-dataset memory** for large datasets — e.g. cap the triple count / sample, or stream/batch the heavy analysis queries so peak memory stays within a configurable budget. This is the path to actually *completing* huge datasets within a fixed limit.
3. **Document the `memory-max-size` failure-mode behaviour** — setting it below the container limit converts OOM kills into catchable `HTTP 500` aborts (resilience), even though it does not lower the peak. Useful as an interim mitigation.

## Environment

- `@lde/sparql-qlever` with QLever `a14e0a0`, native mode.
- `@lde/pipeline` analysis stages (`@lde/pipeline-void`, `@lde/pipeline-shacl-sampler`).
- 16 GiB container (Kubernetes), Node peak RSS ~0.3 GiB (the memory is QLever’s).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large datasets OOM-kill the whole pipeline run (QLever query memory unbounded by memory-max-size; isolate QLever or cap per-dataset memory) #427

Summary

Repro / evidence

Why it kills the whole run

Impact

Proposed directions

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Large datasets OOM-kill the whole pipeline run (QLever query memory unbounded by memory-max-size; isolate QLever or cap per-dataset memory) #427

Description

Summary

Repro / evidence

Why it kills the whole run

Impact

Proposed directions

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions