Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies by aoustry · Pull Request #204 · AntaresSimulatorTeam/GemsPy

aoustry · 2026-04-30T20:59:48Z

Overview

Introduces a Polars-backed lazy loading path for time-series data stored as Parquet files, alongside a pluggable Executor for parallel subproblem execution and a MaterializationStrategy enum to control when and how much data is loaded into RAM.

New data structures (`gems/study/data.py`)

Three new AbstractDataStructure subclasses defer parquet I/O to get_value() call time:

Class	Parquet layout	Polars optimisation
`LazyTimeSeriesData`	1 column `"0"`, T rows	row indexing deferred
`LazyScenarioSeriesData`	S columns `"0"`…`"S-1"`, 1 effective row	`.head(1)` + column pruning
`LazyTimeScenarioSeriesData`	S columns, T rows	column pruning — only requested scenarios read

A fourth class handles the materialized case:

MaterializedTimeScenarioSeriesData — holds a pre-loaded (T, n_cols) numpy array for a worker's scenario subset. Carries a pre-built dense lookup array so column remapping is a single vectorized numpy index rather than a Python loop.

All get_value implementations use np.unique(..., return_inverse=True) for duplicate-scenario deduplication, replacing the previous Python dict-loop pattern.

`DataBase.materialize(mc_scenario_ids)`

Returns a new DataBase with every lazy entry replaced by in-memory data. Only the data-series columns actually needed by the given MC scenario IDs are read from each parquet file (via ScenarioBuilder when present). Eager entries are carried over unchanged. The original DataBase is never mutated.

Parquet detection and backward compatibility

_build_data() checks for a .parquet file first; falls back transparently to the existing .txt / .tsv eager path when none is present. Both formats can coexist in the same directory — parquet always takes precedence.

`MaterializationStrategy` enum (`gems/session/session.py`)

Replaces the previous materialize_per_worker: bool field with a three-value enum:

Value	Behaviour
`NONE` (default)	Lazy I/O on every `get_value` call
`UPFRONT`	Full database loaded into RAM once before the run loop starts
`PER_WORKER`	Each scenario (sequential) or batch (parallel) materializes only its own columns

UPFRONT is the right choice when data fits comfortably in RAM and I/O is expensive relative to compute (e.g. network-attached storage). PER_WORKER is the right choice for distributed workers where minimal per-worker RAM footprint matters. NONE is appropriate when parquet files are local and small relative to the time horizon.

Pluggable executor (`SimulationSession`)

SimulationSession gains an optional executor: concurrent.futures.Executor field:

Sequential mode — each scenario is submitted as an independent task; None runs them in the calling thread.
Parallel mode — (block, scenario) work items are grouped into batches and submitted concurrently. _run_batch receives the (possibly pre-materialized) study as an explicit argument.

The lazy data structures are cheaply picklable (URI strings only), so remote workers (Ray, ProcessPoolExecutor) receive near-zero serialisation overhead.

Thread safety (`Study.__post_init__`)

Study.model_components and Study.models are cached_property fields. Pre-populating them in __post_init__ ensures concurrent workers only ever perform reads, eliminating the racy first-write through cached_property.__set_name__.

Tests

tests/unittests/data/test_data.py — 13 new tests covering MaterializedTimeScenarioSeriesData (get_value, duplicates, missing keys, check_requirement) and DataBase.materialize (all lazy types, eager carry-over, immutability, selective columns, ScenarioBuilder mapping).
tests/e2e/functional/test_parquet_dataseries.py — 6 tests: one per lazy type, parquet-takes-precedence, PER_WORKER, and UPFRONT.
tests/e2e/functional/test_parquet_complex_study.py — 2 study-level comparison tests on the 7_4 study (12 components, 168 timesteps): parquet+tsv coexistence and parquet-only mode, both asserting numerical identity with the TSV reference run.

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Introduces three lazy AbstractDataStructure subclasses (LazyTimeSeriesData, LazyScenarioSeriesData, LazyTimeScenarioSeriesData) that defer parquet I/O to get_value() call time using Polars column pruning — only the requested scenario columns are read from disk, making each worker's payload proportional to its (T_block, B_scenarios) slice rather than the full dataset. Parquet format: one file per timeseries name in input/data-series/, columns named by 0-based scenario index ("0", "1", …). _build_data() detects a .parquet file first and returns the lazy variant; falls back to the existing .txt/.tsv eager path when no parquet is present (full backward compatibility). SimulationSession gains an optional executor field (concurrent.futures.Executor) wired into _run_parallel(): when provided, all (block, scenario) work items are submitted concurrently; None preserves the current sequential behaviour. LazyDataBase is cheaply picklable (URIs only) so remote workers receive near-zero serialisation overhead. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

…tures Four end-to-end tests in test_parquet_dataseries.py verify that all three lazy parquet data structures produce the same optimization results as the existing txt-based eager structures: - LazyTimeSeriesData (mirrors test_basic_balance_time_only_series) - LazyScenarioSeriesData (mirrors test_basic_balance_scenario_only_series) - LazyTimeScenarioSeriesData with ScenarioBuilder (mirrors test_system_with_scenarization) - Parquet-takes-precedence-over-txt backward-compatibility check The e2e tests exposed a DuplicateError in LazyScenarioSeriesData and LazyTimeScenarioSeriesData: when ScenarioBuilder maps multiple MC scenarios to the same data-series column, cols contains duplicates which Polars rejects in .select(). Fixed by deduplicating cols before the scan, then re-expanding the result array to the original (possibly duplicated) column order. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Tests the full 7_4 study (12 components, 168 timesteps) with two scenarios: parquet+TSV coexisting (parquet takes precedence) and parquet-only (no TSV fallback). Both verify LazyTimeScenarioSeriesData is used and that simulation tables are numerically identical to the TSV reference run. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

…mode Sequential mode: each scenario runs as an independent batch submitted to the executor (time blocks within a scenario still run in order with carry-over). Without an executor, behavior is unchanged. Parallel mode: add blocks_per_batch (default 1) to ResolutionConfig so multiple consecutive time blocks can be grouped into one executor submission. Each block in a batch still solves its own independent LP problem; batching only reduces dispatch overhead for remote workers. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

blocks_per_batch is not a config-file concern; callers who need non-default batching pass it directly to _run_parallel(blocks_per_batch=N). https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

DataBase.materialize(mc_scenario_ids) returns a copy of the database where every lazy entry is pre-loaded from its parquet file. For LazyTimeScenario- SeriesData, only the data-series columns that the given MC scenarios map to are fetched (via ScenarioBuilder), backed by the new MaterializedTimeScenario- SeriesData class which holds the loaded array and a col_pos remapping dict. SimulationSession gains materialize_per_worker=False. When True, each worker (_run_one_scenario_sequential and _run_batch) materializes a Study copy with the pre-loaded database before its block loop, so each parquet file is read exactly once per worker rather than once per get_value call. _run_block gains an optional study parameter so callers can pass the materialized Study without mutating the shared session state. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

1. Honest API: remove scenario_cols parameter from LazyTimeSeriesData.materialize() and LazyScenarioSeriesData.materialize() since neither benefits from column selectivity (single column / single row). DataBase.materialize() now dispatches per type: selective loading only for LazyTimeScenarioSeriesData. 2. Thread safety: add __post_init__ to Study that pre-populates both cached_property fields (model_components, models) at construction time so worker threads only ever perform reads, never a racy first-write. 3. Tests: 13 new tests covering MaterializedTimeScenarioSeriesData.get_value (basic, duplicate scenarios, error paths, check_requirement), DataBase.materialize() (each lazy type, eager carry-over, immutability, selective columns, ScenarioBuilder integration), and materialize_per_worker=True session behavior. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Uses .head(1) instead of .collect() to avoid loading all T rows when only the first row is needed for a scenario-only series. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Replace Python dict-based dedup + list-comprehension re-expansion with np.unique(..., return_inverse=True) in LazyScenarioSeriesData and LazyTimeScenarioSeriesData. Add a pre-built dense lookup array (_lookup) to MaterializedTimeScenarioSeriesData so that col_pos lookups are fully vectorized instead of a Python loop over the scenario array. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

…y enum Introduces three mutually exclusive strategies: NONE – lazy I/O on every get_value call (previous default) UPFRONT – full database loaded into RAM once before the run loop PER_WORKER – each scenario/batch materialises only its own columns _run_batch now receives study as an explicit required parameter instead of falling back to self.study, removing the optional-with-fallback smell. Adds test_materialize_upfront_same_result_as_lazy to cover the new path. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

claude added 12 commits April 30, 2026 20:04

Move blocks_per_batch from ResolutionConfig to _run_parallel parameter

93ee5d0

blocks_per_batch is not a config-file concern; callers who need non-default batching pass it directly to _run_parallel(blocks_per_batch=N). https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Apply black formatting

d34d0cf

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

Apply black formatting to branch-modified files

751224e

https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

fix: LazyScenarioSeriesData.materialize() reads only first row

5307edc

Uses .head(1) instead of .collect() to avoid loading all T rows when only the first row is needed for a scenario-only series. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5

aoustry marked this pull request as draft May 1, 2026 08:02

aoustry changed the title ~~Add Polars-backed lazy parquet data structures and pluggable executor~~ Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204

Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204
aoustry wants to merge 12 commits intomainfrom
claude/parquet-lazy-database-DloGS

aoustry commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aoustry commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

New data structures (gems/study/data.py)

DataBase.materialize(mc_scenario_ids)

Parquet detection and backward compatibility

MaterializationStrategy enum (gems/session/session.py)

Pluggable executor (SimulationSession)

Thread safety (Study.__post_init__)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aoustry commented Apr 30, 2026 •

edited

Loading

New data structures (`gems/study/data.py`)

`DataBase.materialize(mc_scenario_ids)`

`MaterializationStrategy` enum (`gems/session/session.py`)

Pluggable executor (`SimulationSession`)

Thread safety (`Study.__post_init__`)