Skip to content

Large datasets OOM-kill the whole pipeline run (QLever query memory unbounded by memory-max-size; isolate QLever or cap per-dataset memory) #427

Description

@ddeboer

Summary

A single oversized dataset can make QLever consume ~13 GiB+ of query memory during analysis stages, which OOM-kills the whole pipeline pod and aborts the entire run — all remaining datasets go unprocessed. Because the pipeline runs Node and native QLever in the same container/cgroup, QLever’s OOM takes the Node process down with it. Neither memory-max-size nor cache-max-size bounds the peak.

This was found while running the dataset-knowledge-graph pipeline (which builds on @lde/pipeline + @lde/sparql-qlever) on a 16 GiB Kubernetes pod, but the underlying behaviour is in LDE.

Repro / evidence

Dataset: an openarchieven linkset, ~219.9 M triples. Run locally in the dkg image (Node + native qlever-server a14e0a0 in one container), memory-capped, processing only that dataset.

  • Peak is ~13 GiB of anonymous (query) memory, reached transiently during heavy analysis stages and released between them. The peak stage varies across runs: datatypes.rq, class-properties-subjects.rq, object-uris.rq, shacl-sample-*. So it is the dataset’s scale, not one pathological query.
  • memory-max-size does not bound the peak: measured anon peak ≈ 13.0–13.3 GiB at caps of 8G, 10G, and 16G. The cap only changes the failure mode: set below the container limit it turns hard kills into graceful HTTP 500 aborts (caught per stage, run continues, summary partial); set at/above the container limit the container is OOM-killed.
  • cache-max-size does not bound the peak: anon peak ≈ 13.0 GiB at both cache-max-size=1G and the default 30G.
  • OOMKilled=true confirmed via the container’s cgroup state.

Why it kills the whole run

  • The dkg image is FROM ${QLEVER_IMAGE} and runs qlever-server native, in the same container as the Node pipeline (NativeTaskRunner). One cgroup, one memory limit.
  • On Kubernetes the container OOM (with memory.oom.group=1) kills all processes in the container cgroup atomically → Node dies too → backoffLimit reached → run aborts.
  • Locally (Docker, oom.group=0) the OOM killed only qlever-server; Node survived, caught the subsequent fetch failed errors per stage, and the run completed (exit 0) with a partial summary for that dataset. That difference (blast radius) is the crux.

Impact

One dataset that exceeds the memory budget aborts the entire pipeline run, losing all datasets scheduled after it. On a memory-constrained, Burstable pod (request < limit) it is also a prime eviction/OOM target under node pressure, so the failure point is non-deterministic.

Proposed directions

  1. Isolate QLever from the pipeline process so a QLever OOM cannot kill the run. Options: run qlever-server as a separate container/cgroup (sidecar) the pipeline talks to over the network, with a shared volume for index files and a small supervisor that (re)loads the per-dataset index. Requires a new @lde/sparql-qlever topology (today importer + server share one TaskRunner in one process). Net effect: an over-budget dataset degrades to a partial summary instead of killing the run.
  2. Bound per-dataset memory for large datasets — e.g. cap the triple count / sample, or stream/batch the heavy analysis queries so peak memory stays within a configurable budget. This is the path to actually completing huge datasets within a fixed limit.
  3. Document the memory-max-size failure-mode behaviour — setting it below the container limit converts OOM kills into catchable HTTP 500 aborts (resilience), even though it does not lower the peak. Useful as an interim mitigation.

Environment

  • @lde/sparql-qlever with QLever a14e0a0, native mode.
  • @lde/pipeline analysis stages (@lde/pipeline-void, @lde/pipeline-shacl-sampler).
  • 16 GiB container (Kubernetes), Node peak RSS ~0.3 GiB (the memory is QLever’s).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions