Add extend_single_year_dataset for fast dataset year projection by anth-volk · Pull Request #7700 · PolicyEngine/policyengine-us

anth-volk · 2026-03-04T20:41:18Z

Why this is needed

The API v2 alpha and the policyengine Python package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable through sim.calculate() via create_datasets() — a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.

The UK avoids this entirely: policyengine-uk-data publishes entity-level HDFStore files directly, and policyengine-uk has extend_single_year_dataset() which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.

How it works

Dataset schema classes (`dataset_schema.py`)

USSingleYearDataset holds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus a time_period. It can load from / save to Pandas HDFStore files, and provides .copy() for deep-copying all DataFrames.

USMultiYearDataset wraps a dict[int, USSingleYearDataset] keyed by year. Its .load() returns data in {variable: {year: array}} format (time_period_arrays), which is what policyengine-core's Microsimulation expects for multi-year datasets.

Uprating logic (`economic_assumptions.py`)

extend_single_year_dataset(dataset, end_year=2035) takes a single base-year dataset and produces a multi-year dataset by:

Copying the base-year DataFrames for each year from base_year through end_year
Applying multiplicative uprating year-over-year: for each variable column, it looks up system.variables[var].uprating to get a dotted parameter path (e.g. "calibration.gov.irs.soi.employment_income"), resolves it against system.parameters, and computes factor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.
Carrying forward variables without an uprating parameter unchanged (e.g. age, entity IDs).

This is the same approach used by policyengine-uk. The uprating mapping is derived entirely from system.variables at runtime — the 62 variables with explicit uprating = "..." and the 108 variables assigned via default_uprating.py are all picked up automatically. No separate list to maintain.

Dual-path loading (`system.py`)

Microsimulation.__init__ now auto-detects dataset format before calling super().__init__():

HDFStore format (entity names like person, household as top-level HDF5 keys): loads as USSingleYearDataset, extends via extend_single_year_dataset(), and passes the resulting USMultiYearDataset to policyengine-core.
Legacy h5py format (variable names as top-level keys): falls through to the existing CoreMicrosimulation code path, unchanged.

Format detection (_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.

How we verify correctness

Unit tests (22 tests, ~0.3s)

The test suite in tests/microsimulation/data/ uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:

_resolve_parameter (3 tests): valid dotted path, invalid path, partially valid path
_apply_single_year_uprating (7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preserved
extend_single_year_dataset (12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)

Roundtrip validation (policyengine-us-data PR #568)

A separate one-off validation script in -us-data reads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.

Depends on

PolicyEngine/policyengine-us-data#568 — adds HDFStore output format alongside h5py in the data pipeline

Test plan

make test-other passes (runs the 22 unit tests via pytest)
Load an HDFStore file via Microsimulation(dataset="path/to/STATE.hdfstore.h5") — verify it loads and extends correctly
Load a legacy h5py file via Microsimulation(dataset="path/to/STATE.h5") — verify existing path still works
Verify uprated variables (e.g. employment_income) grow year-over-year
Verify non-uprated variables (e.g. age) are carried forward unchanged

🤖 Generated with Claude Code

Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PavelMakarchuk · 2026-03-17T15:30:24Z

PR Review

🔴 Critical (Must Fix)

1. USMultiYearDataset.__init__ uses if/if instead of if/elif — double-processing bug
dataset_schema.py:175-201

If both datasets and file_path are provided, both branches execute and file_path silently overwrites self.datasets. This should be elif. Also, if neither is provided, self.datasets is never set, causing an AttributeError on line 204.

2. _is_hdfstore_format may not work correctly with actual HDFStore files
system.py:218-239

HDFStore (PyTables) files accessed via h5py expose a different key structure than pd.HDFStore.keys(). Consider using pd.HDFStore directly for detection:

with pd.HDFStore(file_path, mode="r") as store:
    return bool(entity_names & {k.strip("/") for k in store.keys()})

3. No handling of USMultiYearDataset passed directly to Microsimulation
system.py:287-308

The dual-path detection handles str and USSingleYearDataset but not USMultiYearDataset. If a caller passes an already-extended multi-year dataset, it falls through to super().__init__() unhandled.

🟡 Should Address

4. validate_file_path validates with h5py but loads with pd.HDFStore
dataset_schema.py:45-68 vs 84-94 — Using different libraries for validation vs loading could cause mismatches. Use the same library for both.

5. _resolve_dataset_path returns None silently for non-HF, non-existent paths
system.py:199-215 — A typo'd path like "data/staet.h5" returns None, skips HDFStore check, and passes the string to super().__init__() producing a confusing error. Consider raising FileNotFoundError early.

6. Test mocking strategy is fragile
test_extend_single_year_dataset.py:736-760 — Direct sys.modules manipulation is thread-unsafe and can leak state. Use unittest.mock.patch.dict("sys.modules", ...) instead.

7. No tests for file I/O paths
The save() / load() / file-based __init__ for both USSingleYearDataset and USMultiYearDataset are untested — these are the paths used in production.

8. USSingleYearDataset.load() may produce duplicate keys across entities
dataset_schema.py:142-147 — If two entities share a column name, the second silently overwrites the first in the returned dict.

🟢 Suggestions

Changelog fragment is very long — consider shortening to "Add extend_single_year_dataset for fast multi-year dataset projection"
Consider adding __repr__ to dataset classes for easier debugging

Validation Summary

Check	Result
Code Patterns	3 critical issues
Test Coverage	2 gaps (no file I/O tests, fragile mocking)
CI Status	No checks found
Architecture	Sound — mirrors policyengine-uk approach
Documentation	PR description is excellent

Recommendation: Address the if/elif bug and HDFStore detection before merge. Core approach is solid.

To auto-fix issues: /fix-pr 7700

anth-volk and others added 5 commits March 4, 2026 21:39

style: Run black formatter on changed files

d99a2f3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add changelog fragment for extend_single_year_dataset

d6a9944

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: Reformat with black -l 79 to match CI lint config

527ddbd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk mentioned this pull request Mar 5, 2026

Add entity-level HDFStore output format alongside h5py PolicyEngine/policyengine-us-data#568

Open

4 tasks

anth-volk marked this pull request as ready for review March 5, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extend_single_year_dataset for fast dataset year projection#7700

Add extend_single_year_dataset for fast dataset year projection#7700
anth-volk wants to merge 5 commits intomainfrom
add-extend-single-year-dataset

anth-volk commented Mar 4, 2026 •

edited

Loading

Uh oh!

PavelMakarchuk commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anth-volk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is needed

How it works

Dataset schema classes (dataset_schema.py)

Uprating logic (economic_assumptions.py)

Dual-path loading (system.py)

How we verify correctness

Unit tests (22 tests, ~0.3s)

Roundtrip validation (policyengine-us-data PR #568)

Depends on

Test plan

Uh oh!

PavelMakarchuk commented Mar 17, 2026

PR Review

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anth-volk commented Mar 4, 2026 •

edited

Loading

Dataset schema classes (`dataset_schema.py`)

Uprating logic (`economic_assumptions.py`)

Dual-path loading (`system.py`)