Add extend_single_year_dataset for fast dataset year projection#7700
Add extend_single_year_dataset for fast dataset year projection#7700
Conversation
Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review🔴 Critical (Must Fix)1. If both 2. HDFStore (PyTables) files accessed via with pd.HDFStore(file_path, mode="r") as store:
return bool(entity_names & {k.strip("/") for k in store.keys()})3. No handling of The dual-path detection handles 🟡 Should Address4. 5. 6. Test mocking strategy is fragile 7. No tests for file I/O paths 8. 🟢 Suggestions
Validation Summary
Recommendation: Address the To auto-fix issues: |
Fixes #7699
Why this is needed
The API v2 alpha and the
policyenginePython package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable throughsim.calculate()viacreate_datasets()— a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.The UK avoids this entirely:
policyengine-uk-datapublishes entity-level HDFStore files directly, andpolicyengine-ukhasextend_single_year_dataset()which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.How it works
Dataset schema classes (
dataset_schema.py)USSingleYearDatasetholds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus atime_period. It can load from / save to Pandas HDFStore files, and provides.copy()for deep-copying all DataFrames.USMultiYearDatasetwraps adict[int, USSingleYearDataset]keyed by year. Its.load()returns data in{variable: {year: array}}format (time_period_arrays), which is what policyengine-core'sMicrosimulationexpects for multi-year datasets.Uprating logic (
economic_assumptions.py)extend_single_year_dataset(dataset, end_year=2035)takes a single base-year dataset and produces a multi-year dataset by:base_yearthroughend_yearsystem.variables[var].upratingto get a dotted parameter path (e.g."calibration.gov.irs.soi.employment_income"), resolves it againstsystem.parameters, and computesfactor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.age, entity IDs).This is the same approach used by
policyengine-uk. The uprating mapping is derived entirely fromsystem.variablesat runtime — the 62 variables with explicituprating = "..."and the 108 variables assigned viadefault_uprating.pyare all picked up automatically. No separate list to maintain.Dual-path loading (
system.py)Microsimulation.__init__now auto-detects dataset format before callingsuper().__init__():person,householdas top-level HDF5 keys): loads asUSSingleYearDataset, extends viaextend_single_year_dataset(), and passes the resultingUSMultiYearDatasetto policyengine-core.CoreMicrosimulationcode path, unchanged.Format detection (
_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.How we verify correctness
Unit tests (22 tests, ~0.3s)
The test suite in
tests/microsimulation/data/uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:_resolve_parameter(3 tests): valid dotted path, invalid path, partially valid path_apply_single_year_uprating(7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preservedextend_single_year_dataset(12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)Roundtrip validation (policyengine-us-data PR #568)
A separate one-off validation script in
-us-datareads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.Depends on
Test plan
make test-otherpasses (runs the 22 unit tests via pytest)Microsimulation(dataset="path/to/STATE.hdfstore.h5")— verify it loads and extends correctlyMicrosimulation(dataset="path/to/STATE.h5")— verify existing path still worksemployment_income) grow year-over-yearage) are carried forward unchanged🤖 Generated with Claude Code