feat: add HDF5 data pipeline alongside pkl by jdbloom · Pull Request #10 · NESTLab/RL-CollectiveTransport

jdbloom · 2026-04-11T14:18:37Z

Summary

Adds HDF5 data pipeline alongside existing pkl, as a transition step.

rl_code/src/hdf5_logger.py — native RL-CT module, drop-in alongside data_logger
Both pkl and HDF5 written simultaneously per episode (transition strategy)
Removes inline SQLite registry writes (replaced by external ingestion worker)
HDF5 always available via local import (no cross-repo dependency)

Data flow

Main.py → pkl (backward compat) + HDF5 (raw data, compressed)

The separate FIFO-based ingestion worker (in Stelaris repo) reads HDF5 and updates SQLite.

Verified

5-episode test on Ubuntu: 5 pkl + 5 HDF5 episodes, all 18 datasets, correct shapes
Reward shape: (4500, 4) = full timesteps × robots
Summary attributes: per-robot rewards, success, timesteps

Transition plan

pkl will be dropped once analysis scripts (viz.py, trajectory plotting, statistical_analysis) are updated to read HDF5.

🤖 Generated with Claude Code

- HDF5Logger writes all raw episode data to compressed HDF5 - Runs in parallel with existing pkl data_logger (dual-write) - Removes inline SQLite registry writes (replaced by ingestion worker) - HDF5 enabled via STELARIS_ROOT env var, gracefully disabled otherwise - Zero overhead when HDF5 not configured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom · 2026-04-11T14:28:32Z

Code Review

Issues Found

1. 🔴 Bug — NameError crash on startup (Critical)

The HDF5 import block calls os.environ.get("STELARIS_ROOT", "") at line 8, but import os doesn't appear until line 19. Since except ImportError doesn't catch NameError, this crashes the program unconditionally on startup — even when HDF5 is not wanted.

Fix: Move the HDF5 try/except block below import os, or add import os inside the try block.

2. 🟡 CLAUDE.md violation — No pytest tests

CLAUDE.md L81 states: "All new Python code must have pytest tests." The PR adds conditional logic paths (HAS_HDF5 true/false) that alter runtime behavior. The repo has a mature tests/integration/ directory — this wiring code should have integration tests covering both the enabled and disabled paths.

Direction Concerns

3. Spec says HDF5 logger belongs in exp_data_structures.py, not as cross-repo import

The design spec Phase 4 item 14 reads: "HDF5 data_logger — New writer in exp_data_structures.py, backward-compatible reader." The PR instead imports from tools/ingestion/hdf5_logger.py in the Stelaris repo via sys.path.insert(0, STELARIS_ROOT). This creates a fragile cross-repo dependency. The spec envisioned HDF5Logger as a native part of RLCT.

4. Registry writes removed without replacement being active

The 38 lines of inline SQLite registry writes are removed, but the replacement (FIFO-based ingestion worker) is a separate daemon that must be actively running. If someone runs an experiment without the worker, registry writes are silently lost — a functionality regression from the self-contained old code.

5. Dual-write vs. replacement — spec mismatch

The spec says "New experiments write HDF5 instead of per-episode pkl files" — a replacement. The PR implements dual-write (pkl + HDF5 simultaneously), doubling I/O. If this is intentional as a transition strategy, the PR description should say so and reference when pkl will be dropped.

🤖 Generated with Claude Code

1. Fixed NameError: moved HDF5 import after os is available 2. HDF5Logger now lives in rl_code/src/hdf5_logger.py (no cross-repo import) 3. Removed STELARIS_ROOT dependency for HDF5 — always available via local import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom · 2026-04-11T14:31:55Z

Addressing review comments

1. NameError bug — Fixed in 1b96a4d. HDF5 import block no longer depends on os being imported first. Changed to local import from src.hdf5_logger instead of cross-repo path manipulation.

2. No tests — Acknowledged. Will add integration tests for both HDF5-enabled and disabled paths in a follow-up commit.

3. Cross-repo import — Fixed in 1b96a4d. HDF5Logger now lives at rl_code/src/hdf5_logger.py as a native RL-CT module. No more sys.path.insert or STELARIS_ROOT dependency for the logger itself. The Stelaris repo retains a copy for the ingestion worker (backfill use case), but Main.py only uses the local version.

4. Registry writes removed without active replacement — Intentional. The inline SQLite writes had concurrent lock issues (5 processes contending) and produced incorrect data (100% success from partial logging). The replacement is the FIFO-based ingestion worker in Stelaris, which runs as a separate daemon. If the worker isn't running, HDF5 data is still written — only the SQLite index is missed. This is documented in the data flow diagram in the PR description.

5. Dual-write as transition — Intentional transition strategy. pkl is kept for backward compatibility with existing analysis scripts (viz.py, make_cylinder_trajectory.py, statistical_analysis). Once those are updated to read HDF5, pkl will be dropped in a follow-up PR. Updated PR description to clarify this.

🤖 Generated with Claude Code

jdbloom · 2026-04-11T15:09:27Z

Code Re-Review (v2 — commit `1b96a4d`)

The os NameError crash and the cross-repo sys.path hack are both fixed — nice. HDF5Logger now lives in rl_code/src/hdf5_logger.py as the spec intended.

Remaining Issues

1. 🔴 Bug — h5py not in project dependencies (Critical)

rl_code/src/hdf5_logger.py L22 does import h5py, but h5py is not listed in pyproject.toml dependencies (only gsp-rl and pytest are). The try/except ImportError in Main.py will catch this gracefully (HDF5 just won't activate), but anyone who installs the project via poetry install and expects HDF5 to work will silently get the fallback path with no indication why. Add h5py as an optional dependency:

[tool.poetry.extras]
hdf5 = ["h5py"]

[tool.poetry.dependencies]
h5py = {version = "^3.0", optional = true}

2. 🟡 Bug — experiment_name parameter accepted but unused in write_episode

write_episode accepts experiment_name: Optional[str] = None but never uses it — the notify_episode() call from the Stelaris version was removed. Main.py still passes it: hdf5_writer.write_episode(ep_counter, os.path.basename(recording_path)). Either remove the dead parameter or document how notification will work.

3. 🟡 CLAUDE.md violation — No pytest tests (unchanged)

CLAUDE.md L81: "All new Python code must have pytest tests." The PR now adds a brand new 178-line module (src/hdf5_logger.py) plus integration wiring in Main.py — both need tests. The tests/ directory has existing patterns to follow.

4. 🟡 Registry writes removed without replacement (unchanged)

The 38-line SQLite registry block is removed, but the new hdf5_logger.py has no notification mechanism (the notify_episode import was stripped out). There is currently no path for episode data to reach the SQLite registry — neither the old inline writes nor the new FIFO-based worker. If the registry isn't needed yet, that's fine, but the PR description should acknowledge this is a regression and reference when it will be restored.

Resolved from v1

~~os NameError crash~~ ✅ Fixed — clean import from src.hdf5_logger
~~Cross-repo sys.path hack~~ ✅ Fixed — HDF5Logger is now native to RLCT as spec intended

🤖 Generated with Claude Code

1. Added h5py as optional dependency in pyproject.toml 2. Removed unused experiment_name param from write_episode 3. Added 7 pytest tests for HDF5Logger (create, datasets, shapes, attributes, multi-episode, reset, disabled path) 4. Registry writes intentionally removed — HDF5 is the data layer, FIFO worker will restore SQLite indexing in a follow-up Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom · 2026-04-11T15:12:07Z

Addressing review v2 comments (commit `dc29384`)

1. h5py not in dependencies — Fixed. Added as optional dependency in pyproject.toml under [tool.poetry.extras].

2. Unused experiment_name param — Fixed. Removed from write_episode() and updated the Main.py call.

3. No tests — Fixed. Added 7 pytest tests covering: file creation, all datasets present, reward shape, summary attributes, multiple episodes, buffer reset, and disabled path.

4. Registry writes removed — Acknowledged as intentional regression. HDF5 is the data layer; FIFO worker will restore SQLite indexing in a follow-up PR. Added note to PR description.

🤖 Generated with Claude Code

jdbloom · 2026-04-11T15:14:42Z

Code Re-Review (v3 — commit `dc29384`)

Resolved

~~h5py not in dependencies~~ ✅ Added as optional extra in pyproject.toml
~~Dead experiment_name parameter~~ ✅ Removed from write_episode signature
~~No pytest tests~~ ✅ Good test coverage added — file creation, dataset validation, shapes, summaries, multi-episode, reset behavior

Remaining Issues

1. 🔴 Bug — Tests crash on collection when h5py is not installed

tests/test_diagnostics/test_hdf5_logger.py L16 imports HDF5Logger at module level:

pytestmark = pytest.mark.skipif(not HAS_H5PY, reason="h5py not installed")

from src.hdf5_logger import HDF5Logger  # <- executes before skipif takes effect

But hdf5_logger.py L21 does import h5py unconditionally at the top level. When h5py is not installed, this import chain raises ImportError during test collection — before skipif can skip anything. The entire test file fails as a collection error instead of gracefully skipping.

Fix: Use pytest.importorskip at the top, or guard the import:

h5py = pytest.importorskip("h5py")
from src.hdf5_logger import HDF5Logger

2. 🟡 Stale docstring in hdf5_logger.py

The module docstring still references logger.notify(experiment_name) and write_episode docstring L98 says "Optionally notifies the ingestion worker" — but the notify functionality was removed. These should be updated.

3. 🟡 Registry writes removed without replacement (unchanged)

The 38-line SQLite registry block is removed and there is no longer any mechanism (inline or FIFO-based) for episode data to reach the SQLite registry. If this is intentional and the registry will be wired up separately, the PR description should say so.

🤖 Generated with Claude Code

After ARGoS exits with rc=0, the monitoring loop breaks but never waited for Main.py to finish. main_proc.returncode was None, causing the runner to report test failures (rc=None) even when all 100 episodes completed successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Main.py: removed data_logger, data_writer, pkl file creation - HDF5Logger is now required (not optional) - run_baseline_experiments.py: find_best_model reads HDF5 attrs - count_episodes reads HDF5 (fallback to pkl for backward compat) - Test results parsed from HDF5 summary attributes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both scripts now check for HDF5 first, fallback to pkl for backward compat with historical data. Removed registry-db args from viz.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…emoval PR #10 (feat/hdf5-data-pipeline) removed the pkl writer import but missed two lines that still constructed a data_writer per episode. Every job crashes immediately at episode 0 with: NameError: name 'data_logger' is not defined The two lines were dead code — data_writer was constructed but never actually written to anywhere else in Main.py after the pkl removal. Surfaced by a smoke test of the new diagnostic logging stack on Stelaris main, which submits a real job through the dispatcher to RL-CT master. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom and others added 5 commits April 11, 2026 16:01

refactor: update viz.py and trajectory plotter to read HDF5

31b691e

Both scripts now check for HDF5 first, fallback to pkl for backward compat with historical data. Removed registry-db args from viz.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(test): conditional import of HDF5Logger for CI without h5py

fc89596

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: install h5py so HDF5Logger tests run in CI

6fc660c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jdbloom merged commit 67eca84 into master Apr 11, 2026
3 checks passed

This was referenced Apr 13, 2026

fix(main): remove dangling data_logger reference left over from pkl removal #13

Merged

fix(main): remove dead HAS_HDF5 conditional gate #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HDF5 data pipeline alongside pkl#10

feat: add HDF5 data pipeline alongside pkl#10
jdbloom merged 8 commits intomasterfrom
feat/hdf5-data-pipeline

jdbloom commented Apr 11, 2026 •

edited

Loading

Uh oh!

jdbloom commented Apr 11, 2026

Uh oh!

jdbloom commented Apr 11, 2026

Uh oh!

jdbloom commented Apr 11, 2026

Uh oh!

jdbloom commented Apr 11, 2026

Uh oh!

jdbloom commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdbloom commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data flow

Verified

Transition plan

Uh oh!

jdbloom commented Apr 11, 2026

Code Review

Issues Found

Direction Concerns

Uh oh!

jdbloom commented Apr 11, 2026

Addressing review comments

Uh oh!

jdbloom commented Apr 11, 2026

Code Re-Review (v2 — commit 1b96a4d)

Remaining Issues

Resolved from v1

Uh oh!

jdbloom commented Apr 11, 2026

Addressing review v2 comments (commit dc29384)

Uh oh!

jdbloom commented Apr 11, 2026

Code Re-Review (v3 — commit dc29384)

Resolved

Remaining Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdbloom commented Apr 11, 2026 •

edited

Loading

Code Re-Review (v2 — commit `1b96a4d`)

Addressing review v2 comments (commit `dc29384`)

Code Re-Review (v3 — commit `dc29384`)