Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204
Draft
Add Polars-backed lazy parquet data structures with pluggable executor and materialization strategies#204
Conversation
Introduces three lazy AbstractDataStructure subclasses (LazyTimeSeriesData,
LazyScenarioSeriesData, LazyTimeScenarioSeriesData) that defer parquet I/O
to get_value() call time using Polars column pruning — only the requested
scenario columns are read from disk, making each worker's payload proportional
to its (T_block, B_scenarios) slice rather than the full dataset.
Parquet format: one file per timeseries name in input/data-series/, columns
named by 0-based scenario index ("0", "1", …). _build_data() detects a
.parquet file first and returns the lazy variant; falls back to the existing
.txt/.tsv eager path when no parquet is present (full backward compatibility).
SimulationSession gains an optional executor field (concurrent.futures.Executor)
wired into _run_parallel(): when provided, all (block, scenario) work items are
submitted concurrently; None preserves the current sequential behaviour.
LazyDataBase is cheaply picklable (URIs only) so remote workers receive
near-zero serialisation overhead.
https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…tures Four end-to-end tests in test_parquet_dataseries.py verify that all three lazy parquet data structures produce the same optimization results as the existing txt-based eager structures: - LazyTimeSeriesData (mirrors test_basic_balance_time_only_series) - LazyScenarioSeriesData (mirrors test_basic_balance_scenario_only_series) - LazyTimeScenarioSeriesData with ScenarioBuilder (mirrors test_system_with_scenarization) - Parquet-takes-precedence-over-txt backward-compatibility check The e2e tests exposed a DuplicateError in LazyScenarioSeriesData and LazyTimeScenarioSeriesData: when ScenarioBuilder maps multiple MC scenarios to the same data-series column, cols contains duplicates which Polars rejects in .select(). Fixed by deduplicating cols before the scan, then re-expanding the result array to the original (possibly duplicated) column order. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Tests the full 7_4 study (12 components, 168 timesteps) with two scenarios: parquet+TSV coexisting (parquet takes precedence) and parquet-only (no TSV fallback). Both verify LazyTimeScenarioSeriesData is used and that simulation tables are numerically identical to the TSV reference run. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…mode Sequential mode: each scenario runs as an independent batch submitted to the executor (time blocks within a scenario still run in order with carry-over). Without an executor, behavior is unchanged. Parallel mode: add blocks_per_batch (default 1) to ResolutionConfig so multiple consecutive time blocks can be grouped into one executor submission. Each block in a batch still solves its own independent LP problem; batching only reduces dispatch overhead for remote workers. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
blocks_per_batch is not a config-file concern; callers who need non-default batching pass it directly to _run_parallel(blocks_per_batch=N). https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
DataBase.materialize(mc_scenario_ids) returns a copy of the database where every lazy entry is pre-loaded from its parquet file. For LazyTimeScenario- SeriesData, only the data-series columns that the given MC scenarios map to are fetched (via ScenarioBuilder), backed by the new MaterializedTimeScenario- SeriesData class which holds the loaded array and a col_pos remapping dict. SimulationSession gains materialize_per_worker=False. When True, each worker (_run_one_scenario_sequential and _run_batch) materializes a Study copy with the pre-loaded database before its block loop, so each parquet file is read exactly once per worker rather than once per get_value call. _run_block gains an optional study parameter so callers can pass the materialized Study without mutating the shared session state. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
1. Honest API: remove scenario_cols parameter from LazyTimeSeriesData.materialize() and LazyScenarioSeriesData.materialize() since neither benefits from column selectivity (single column / single row). DataBase.materialize() now dispatches per type: selective loading only for LazyTimeScenarioSeriesData. 2. Thread safety: add __post_init__ to Study that pre-populates both cached_property fields (model_components, models) at construction time so worker threads only ever perform reads, never a racy first-write. 3. Tests: 13 new tests covering MaterializedTimeScenarioSeriesData.get_value (basic, duplicate scenarios, error paths, check_requirement), DataBase.materialize() (each lazy type, eager carry-over, immutability, selective columns, ScenarioBuilder integration), and materialize_per_worker=True session behavior. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Uses .head(1) instead of .collect() to avoid loading all T rows when only the first row is needed for a scenario-only series. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
Replace Python dict-based dedup + list-comprehension re-expansion with np.unique(..., return_inverse=True) in LazyScenarioSeriesData and LazyTimeScenarioSeriesData. Add a pre-built dense lookup array (_lookup) to MaterializedTimeScenarioSeriesData so that col_pos lookups are fully vectorized instead of a Python loop over the scenario array. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
…y enum Introduces three mutually exclusive strategies: NONE – lazy I/O on every get_value call (previous default) UPFRONT – full database loaded into RAM once before the run loop PER_WORKER – each scenario/batch materialises only its own columns _run_batch now receives study as an explicit required parameter instead of falling back to self.study, removing the optional-with-fallback smell. Adds test_materialize_upfront_same_result_as_lazy to cover the new path. https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Introduces a Polars-backed lazy loading path for time-series data stored as Parquet files, alongside a pluggable
Executorfor parallel subproblem execution and aMaterializationStrategyenum to control when and how much data is loaded into RAM.New data structures (
gems/study/data.py)Three new
AbstractDataStructuresubclasses defer parquet I/O toget_value()call time:LazyTimeSeriesData"0", T rowsLazyScenarioSeriesData"0"…"S-1", 1 effective row.head(1)+ column pruningLazyTimeScenarioSeriesDataA fourth class handles the materialized case:
MaterializedTimeScenarioSeriesData— holds a pre-loaded(T, n_cols)numpy array for a worker's scenario subset. Carries a pre-built dense lookup array so column remapping is a single vectorized numpy index rather than a Python loop.All
get_valueimplementations usenp.unique(..., return_inverse=True)for duplicate-scenario deduplication, replacing the previous Python dict-loop pattern.DataBase.materialize(mc_scenario_ids)Returns a new
DataBasewith every lazy entry replaced by in-memory data. Only the data-series columns actually needed by the given MC scenario IDs are read from each parquet file (viaScenarioBuilderwhen present). Eager entries are carried over unchanged. The originalDataBaseis never mutated.Parquet detection and backward compatibility
_build_data()checks for a.parquetfile first; falls back transparently to the existing.txt/.tsveager path when none is present. Both formats can coexist in the same directory — parquet always takes precedence.MaterializationStrategyenum (gems/session/session.py)Replaces the previous
materialize_per_worker: boolfield with a three-value enum:NONE(default)get_valuecallUPFRONTPER_WORKERUPFRONTis the right choice when data fits comfortably in RAM and I/O is expensive relative to compute (e.g. network-attached storage).PER_WORKERis the right choice for distributed workers where minimal per-worker RAM footprint matters.NONEis appropriate when parquet files are local and small relative to the time horizon.Pluggable executor (
SimulationSession)SimulationSessiongains an optionalexecutor: concurrent.futures.Executorfield:Noneruns them in the calling thread.(block, scenario)work items are grouped into batches and submitted concurrently._run_batchreceives the (possibly pre-materialized) study as an explicit argument.The lazy data structures are cheaply picklable (URI strings only), so remote workers (Ray,
ProcessPoolExecutor) receive near-zero serialisation overhead.Thread safety (
Study.__post_init__)Study.model_componentsandStudy.modelsarecached_propertyfields. Pre-populating them in__post_init__ensures concurrent workers only ever perform reads, eliminating the racy first-write throughcached_property.__set_name__.Tests
tests/unittests/data/test_data.py— 13 new tests coveringMaterializedTimeScenarioSeriesData(get_value, duplicates, missing keys, check_requirement) andDataBase.materialize(all lazy types, eager carry-over, immutability, selective columns, ScenarioBuilder mapping).tests/e2e/functional/test_parquet_dataseries.py— 6 tests: one per lazy type, parquet-takes-precedence,PER_WORKER, andUPFRONT.tests/e2e/functional/test_parquet_complex_study.py— 2 study-level comparison tests on the 7_4 study (12 components, 168 timesteps): parquet+tsv coexistence and parquet-only mode, both asserting numerical identity with the TSV reference run.https://claude.ai/code/session_01GYyht7zSYAB5m4V9mL2cB5