Skip to content

Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204

Draft
aoustry wants to merge 12 commits intomainfrom
claude/parquet-lazy-database-DloGS
Draft

Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204
aoustry wants to merge 12 commits intomainfrom
claude/parquet-lazy-database-DloGS

Conversation

@aoustry
Copy link
Copy Markdown
Collaborator

@aoustry aoustry commented Apr 30, 2026

Overview

Introduces a Polars-backed lazy loading path for time-series data stored as Parquet files, alongside a pluggable Executor for parallel subproblem execution and a MaterializationStrategy enum to control when and how much data is loaded into RAM.

New data structures (gems/study/data.py)

Three new AbstractDataStructure subclasses defer parquet I/O to get_value() call time:

Class Parquet layout Polars optimisation
LazyTimeSeriesData 1 column "0", T rows row indexing deferred
LazyScenarioSeriesData S columns "0""S-1", 1 effective row .head(1) + column pruning
LazyTimeScenarioSeriesData S columns, T rows column pruning — only requested scenarios read

A fourth class handles the materialized case:

  • MaterializedTimeScenarioSeriesData — holds a pre-loaded (T, n_cols) numpy array for a worker's scenario subset. Carries a pre-built dense lookup array so column remapping is a single vectorized numpy index rather than a Python loop.

All get_value implementations use np.unique(..., return_inverse=True) for duplicate-scenario deduplication, replacing the previous Python dict-loop pattern.

DataBase.materialize(mc_scenario_ids)

Returns a new DataBase with every lazy entry replaced by in-memory data. Only the data-series columns actually needed by the given MC scenario IDs are read from each parquet file (via ScenarioBuilder when present). Eager entries are carried over unchanged. The original DataBase is never mutated.

Parquet detection and backward compatibility

_build_data() checks for a .parquet file first; falls back transparently to the existing .txt / .tsv eager path when none is present. Both formats can coexist in the same directory — parquet always takes precedence.

MaterializationStrategy enum (gems/session/session.py)

Replaces the previous materialize_per_worker: bool field with a three-value enum:

Value Behaviour
NONE (default) Lazy I/O on every get_value call
UPFRONT Full database loaded into RAM once before the run loop starts
PER_WORKER Each scenario (sequential) or batch (parallel) materializes only its own columns

UPFRONT is the right choice when data fits comfortably in RAM and I/O is expensive relative to compute (e.g. network-attached storage). PER_WORKER is the right choice for distributed workers where minimal per-worker RAM footprint matters. NONE is appropriate when parquet files are local and small relative to the time horizon.

Pluggable executor (SimulationSession)

SimulationSession gains an optional executor: concurrent.futures.Executor field:

  • Sequential mode — each scenario is submitted as an independent task; None runs them in the calling thread.
  • Parallel mode(block, scenario) work items are grouped into batches and submitted concurrently. _run_batch receives the (possibly pre-materialized) study as an explicit argument.

The lazy data structures are cheaply picklable (URI strings only), so remote workers (Ray, ProcessPoolExecutor) receive near-zero serialisation overhead.

Thread safety (Study.__post_init__)

Study.model_components and Study.models are cached_property fields. Pre-populating them in __post_init__ ensures concurrent workers only ever perform reads, eliminating the racy first-write through cached_property.__set_name__.

Tests

  • tests/unittests/data/test_data.py — 13 new tests covering MaterializedTimeScenarioSeriesData (get_value, duplicates, missing keys, check_requirement) and DataBase.materialize (all lazy types, eager carry-over, immutability, selective columns, ScenarioBuilder mapping).
  • tests/e2e/functional/test_parquet_dataseries.py — 6 tests: one per lazy type, parquet-takes-precedence, PER_WORKER, and UPFRONT.
  • tests/e2e/functional/test_parquet_complex_study.py — 2 study-level comparison tests on the 7_4 study (12 components, 168 timesteps): parquet+tsv coexistence and parquet-only mode, both asserting numerical identity with the TSV reference run.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

claude added 12 commits April 30, 2026 20:04
Introduces three lazy AbstractDataStructure subclasses (LazyTimeSeriesData,
LazyScenarioSeriesData, LazyTimeScenarioSeriesData) that defer parquet I/O
to get_value() call time using Polars column pruning — only the requested
scenario columns are read from disk, making each worker's payload proportional
to its (T_block, B_scenarios) slice rather than the full dataset.

Parquet format: one file per timeseries name in input/data-series/, columns
named by 0-based scenario index ("0", "1", …). _build_data() detects a
.parquet file first and returns the lazy variant; falls back to the existing
.txt/.tsv eager path when no parquet is present (full backward compatibility).

SimulationSession gains an optional executor field (concurrent.futures.Executor)
wired into _run_parallel(): when provided, all (block, scenario) work items are
submitted concurrently; None preserves the current sequential behaviour.
LazyDataBase is cheaply picklable (URIs only) so remote workers receive
near-zero serialisation overhead.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…tures

Four end-to-end tests in test_parquet_dataseries.py verify that all three
lazy parquet data structures produce the same optimization results as the
existing txt-based eager structures:
  - LazyTimeSeriesData (mirrors test_basic_balance_time_only_series)
  - LazyScenarioSeriesData (mirrors test_basic_balance_scenario_only_series)
  - LazyTimeScenarioSeriesData with ScenarioBuilder (mirrors test_system_with_scenarization)
  - Parquet-takes-precedence-over-txt backward-compatibility check

The e2e tests exposed a DuplicateError in LazyScenarioSeriesData and
LazyTimeScenarioSeriesData: when ScenarioBuilder maps multiple MC scenarios
to the same data-series column, cols contains duplicates which Polars rejects
in .select(). Fixed by deduplicating cols before the scan, then re-expanding
the result array to the original (possibly duplicated) column order.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Tests the full 7_4 study (12 components, 168 timesteps) with two scenarios:
parquet+TSV coexisting (parquet takes precedence) and parquet-only (no TSV
fallback). Both verify LazyTimeScenarioSeriesData is used and that simulation
tables are numerically identical to the TSV reference run.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…mode

Sequential mode: each scenario runs as an independent batch submitted to
the executor (time blocks within a scenario still run in order with
carry-over). Without an executor, behavior is unchanged.

Parallel mode: add blocks_per_batch (default 1) to ResolutionConfig so
multiple consecutive time blocks can be grouped into one executor submission.
Each block in a batch still solves its own independent LP problem; batching
only reduces dispatch overhead for remote workers.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
blocks_per_batch is not a config-file concern; callers who need non-default
batching pass it directly to _run_parallel(blocks_per_batch=N).

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
DataBase.materialize(mc_scenario_ids) returns a copy of the database where
every lazy entry is pre-loaded from its parquet file. For LazyTimeScenario-
SeriesData, only the data-series columns that the given MC scenarios map to
are fetched (via ScenarioBuilder), backed by the new MaterializedTimeScenario-
SeriesData class which holds the loaded array and a col_pos remapping dict.

SimulationSession gains materialize_per_worker=False. When True, each worker
(_run_one_scenario_sequential and _run_batch) materializes a Study copy with
the pre-loaded database before its block loop, so each parquet file is read
exactly once per worker rather than once per get_value call.

_run_block gains an optional study parameter so callers can pass the
materialized Study without mutating the shared session state.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
1. Honest API: remove scenario_cols parameter from LazyTimeSeriesData.materialize()
   and LazyScenarioSeriesData.materialize() since neither benefits from column
   selectivity (single column / single row). DataBase.materialize() now dispatches
   per type: selective loading only for LazyTimeScenarioSeriesData.

2. Thread safety: add __post_init__ to Study that pre-populates both cached_property
   fields (model_components, models) at construction time so worker threads only
   ever perform reads, never a racy first-write.

3. Tests: 13 new tests covering MaterializedTimeScenarioSeriesData.get_value
   (basic, duplicate scenarios, error paths, check_requirement), DataBase.materialize()
   (each lazy type, eager carry-over, immutability, selective columns, ScenarioBuilder
   integration), and materialize_per_worker=True session behavior.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Uses .head(1) instead of .collect() to avoid loading all T rows when
only the first row is needed for a scenario-only series.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Replace Python dict-based dedup + list-comprehension re-expansion with
np.unique(..., return_inverse=True) in LazyScenarioSeriesData and
LazyTimeScenarioSeriesData.

Add a pre-built dense lookup array (_lookup) to
MaterializedTimeScenarioSeriesData so that col_pos lookups are fully
vectorized instead of a Python loop over the scenario array.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…y enum

Introduces three mutually exclusive strategies:
  NONE       – lazy I/O on every get_value call (previous default)
  UPFRONT    – full database loaded into RAM once before the run loop
  PER_WORKER – each scenario/batch materialises only its own columns

_run_batch now receives study as an explicit required parameter instead
of falling back to self.study, removing the optional-with-fallback smell.

Adds test_materialize_upfront_same_result_as_lazy to cover the new path.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
@aoustry aoustry marked this pull request as draft May 1, 2026 08:02
@aoustry aoustry changed the title Add Polars-backed lazy parquet data structures and pluggable executor Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants