Skip to content

Add extend_single_year_dataset for fast dataset year projection#7700

Open
anth-volk wants to merge 5 commits intomainfrom
add-extend-single-year-dataset
Open

Add extend_single_year_dataset for fast dataset year projection#7700
anth-volk wants to merge 5 commits intomainfrom
add-extend-single-year-dataset

Conversation

@anth-volk
Copy link
Contributor

@anth-volk anth-volk commented Mar 4, 2026

Fixes #7699

Why this is needed

The API v2 alpha and the policyengine Python package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable through sim.calculate() via create_datasets() — a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.

The UK avoids this entirely: policyengine-uk-data publishes entity-level HDFStore files directly, and policyengine-uk has extend_single_year_dataset() which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.

How it works

Dataset schema classes (dataset_schema.py)

USSingleYearDataset holds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus a time_period. It can load from / save to Pandas HDFStore files, and provides .copy() for deep-copying all DataFrames.

USMultiYearDataset wraps a dict[int, USSingleYearDataset] keyed by year. Its .load() returns data in {variable: {year: array}} format (time_period_arrays), which is what policyengine-core's Microsimulation expects for multi-year datasets.

Uprating logic (economic_assumptions.py)

extend_single_year_dataset(dataset, end_year=2035) takes a single base-year dataset and produces a multi-year dataset by:

  1. Copying the base-year DataFrames for each year from base_year through end_year
  2. Applying multiplicative uprating year-over-year: for each variable column, it looks up system.variables[var].uprating to get a dotted parameter path (e.g. "calibration.gov.irs.soi.employment_income"), resolves it against system.parameters, and computes factor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.
  3. Carrying forward variables without an uprating parameter unchanged (e.g. age, entity IDs).

This is the same approach used by policyengine-uk. The uprating mapping is derived entirely from system.variables at runtime — the 62 variables with explicit uprating = "..." and the 108 variables assigned via default_uprating.py are all picked up automatically. No separate list to maintain.

Dual-path loading (system.py)

Microsimulation.__init__ now auto-detects dataset format before calling super().__init__():

  • HDFStore format (entity names like person, household as top-level HDF5 keys): loads as USSingleYearDataset, extends via extend_single_year_dataset(), and passes the resulting USMultiYearDataset to policyengine-core.
  • Legacy h5py format (variable names as top-level keys): falls through to the existing CoreMicrosimulation code path, unchanged.

Format detection (_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.

How we verify correctness

Unit tests (22 tests, ~0.3s)

The test suite in tests/microsimulation/data/ uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:

  • _resolve_parameter (3 tests): valid dotted path, invalid path, partially valid path
  • _apply_single_year_uprating (7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preserved
  • extend_single_year_dataset (12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)

Roundtrip validation (policyengine-us-data PR #568)

A separate one-off validation script in -us-data reads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.

Depends on

Test plan

  • make test-other passes (runs the 22 unit tests via pytest)
  • Load an HDFStore file via Microsimulation(dataset="path/to/STATE.hdfstore.h5") — verify it loads and extends correctly
  • Load a legacy h5py file via Microsimulation(dataset="path/to/STATE.h5") — verify existing path still works
  • Verify uprated variables (e.g. employment_income) grow year-over-year
  • Verify non-uprated variables (e.g. age) are carried forward unchanged

🤖 Generated with Claude Code

anth-volk and others added 5 commits March 4, 2026 21:39
Adds USSingleYearDataset and USMultiYearDataset schema classes,
extend_single_year_dataset() with multiplicative uprating from the
parameter tree, and dual-path loading in Microsimulation that
auto-detects entity-level HDFStore files and extends them without
routing through the simulation engine.

Legacy h5py files continue to work via the existing code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 tests covering _resolve_parameter, _apply_single_year_uprating,
and end-to-end extend_single_year_dataset. Uses mock system objects
to avoid loading the full tax-benefit system (~0.3s total runtime).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PavelMakarchuk
Copy link
Collaborator

PR Review

🔴 Critical (Must Fix)

1. USMultiYearDataset.__init__ uses if/if instead of if/elif — double-processing bug
dataset_schema.py:175-201

If both datasets and file_path are provided, both branches execute and file_path silently overwrites self.datasets. This should be elif. Also, if neither is provided, self.datasets is never set, causing an AttributeError on line 204.

2. _is_hdfstore_format may not work correctly with actual HDFStore files
system.py:218-239

HDFStore (PyTables) files accessed via h5py expose a different key structure than pd.HDFStore.keys(). Consider using pd.HDFStore directly for detection:

with pd.HDFStore(file_path, mode="r") as store:
    return bool(entity_names & {k.strip("/") for k in store.keys()})

3. No handling of USMultiYearDataset passed directly to Microsimulation
system.py:287-308

The dual-path detection handles str and USSingleYearDataset but not USMultiYearDataset. If a caller passes an already-extended multi-year dataset, it falls through to super().__init__() unhandled.


🟡 Should Address

4. validate_file_path validates with h5py but loads with pd.HDFStore
dataset_schema.py:45-68 vs 84-94 — Using different libraries for validation vs loading could cause mismatches. Use the same library for both.

5. _resolve_dataset_path returns None silently for non-HF, non-existent paths
system.py:199-215 — A typo'd path like "data/staet.h5" returns None, skips HDFStore check, and passes the string to super().__init__() producing a confusing error. Consider raising FileNotFoundError early.

6. Test mocking strategy is fragile
test_extend_single_year_dataset.py:736-760 — Direct sys.modules manipulation is thread-unsafe and can leak state. Use unittest.mock.patch.dict("sys.modules", ...) instead.

7. No tests for file I/O paths
The save() / load() / file-based __init__ for both USSingleYearDataset and USMultiYearDataset are untested — these are the paths used in production.

8. USSingleYearDataset.load() may produce duplicate keys across entities
dataset_schema.py:142-147 — If two entities share a column name, the second silently overwrites the first in the returned dict.


🟢 Suggestions

  • Changelog fragment is very long — consider shortening to "Add extend_single_year_dataset for fast multi-year dataset projection"
  • Consider adding __repr__ to dataset classes for easier debugging

Validation Summary

Check Result
Code Patterns 3 critical issues
Test Coverage 2 gaps (no file I/O tests, fragile mocking)
CI Status No checks found
Architecture Sound — mirrors policyengine-uk approach
Documentation PR description is excellent

Recommendation: Address the if/elif bug and HDFStore detection before merge. Core approach is solid.

To auto-fix issues: /fix-pr 7700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extend_single_year_dataset for fast dataset year projection

2 participants