Blend entity values on would_file draws; fix entity weights by baogorek · Pull Request #611 · PolicyEngine/policyengine-us-data

baogorek · 2026-03-16T17:18:01Z

Summary

This PR hardens the Modal pipeline for production use, covering three major areas:

1. Pre-baked Modal images (eliminate runtime installs)

Before: Every function call (15+ sites across 4 files) did a fresh git clone + uv sync, downloading 858MB of PyTorch/CUDA dependencies each time. ~15% failure rate from network timeouts.
After: Source code and dependencies are baked into Modal image layers at build time via add_local_dir(copy=True). Modal caches layers by content hash — unchanged code skips the build entirely.
New modal_app/images.py with shared cpu_image and gpu_image
New modal_app/resilience.py with subprocess retry wrapper
All 4 Modal apps (data_build, remote_calibration_runner, local_area, pipeline) updated to use pre-baked images
Fixes Python 3.11→3.13 mismatch in remote_calibration_runner

2. End-to-end pipeline orchestrator (`pipeline.py`)

5-step orchestrated pipeline: build datasets → build calibration package → fit weights (regional + national in parallel) → build H5s + stage → finalize
Resume support via run metadata (survives preemption)
Parallel regional + national calibration fitting
Validation diagnostics aggregation
Atomic promotion workflow (staging → production)

3. Auto-trigger on merge to main

New .github/workflows/pipeline.yaml triggers on push to main
Uses modal run --detach (fire and forget — GH Actions runner exits, Modal runs independently)
Supports workflow_dispatch with configurable GPU/epochs/workers/skip_national

Supporting fixes (earlier commits)

Blend entity values on would_file draws; fix entity weights
Salt takeup draws with hh_id:clone_idx instead of block:hh_id
Fix county precomputation crashes (LA County zip_code, geography.npz removal)
Deduplicate county precomputation; enable aca_ptc/eitc/ctc targets
Configure distinct national vs regional calibration hyperparameters
Fix national H5 build: artifact validation remap and geography/weights mismatch

Deployment flow on merge

Step 1 builds all datasets and uploads enhanced_cps_2024.h5 + cps_2024.h5 directly to HF + GCS (no promote gate)
Steps 2-3 build calibration package and fit regional + national weights
Step 4 builds local area H5s (states/districts/cities) + national US.h5, stages to HF staging/
Promote (manual one-liner) pushes staged H5s to HF production + GCS

Test plan

Lint passes (ruff format)
Smoke test passes
uv.lock freshness check passes
Modal image builds successfully with pre-baked code (CI in progress)
Full pipeline smoke test: modal run modal_app/data_build.py::main --no-upload --skip-tests --skip-enhanced-cps

🤖 Generated with Claude Code

baogorek · 2026-03-17T19:25:44Z

Removed `stacked_dataset_builder.py` and `geography.npz` artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

Matrix builder: precompute entity values with would_file=False alongside the all-True values, then blend per tax unit based on the would_file draw before applying target takeup draws. This ensures X@w matches sim.calculate for targets affected by non-target state variables. Fixes #609 publish_local_area: remove explicit sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used incorrect person-count splitting. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes #610 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace block-based RNG salting with (hh_id, clone_idx) salting. Draws are now tied to the donor household identity and independent across clones, eliminating the multi-clone-same-block collision issue (#597). Geographic variation comes through the rate threshold, not the draw. Closes #597 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

County precomputation crashes on LA County (06037) because aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001' for LA County in both precomputation and publish_local_area so X @ w matches sim.calculate("aca_ptc").sum(). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The zip_code set for LA County (06037) was being wiped by delete_arrays which only preserved "county". Also apply the 06037 zip_code fix to the in-process county precomputation path (not just the parallel worker function). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The only county-dependent variable (aca_ptc) does not depend on would_file_taxes_voluntarily, so the entity_wf_false pass was computing identical values. Removing it eliminates ~2,977 extra simulation passes during --county-level builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ptc/eitc/ctc targets - Fix geography.npz n_clones: was saving unique CD count instead of actual clone count (line 1292 of unified_calibration.py) - Deduplicate county precomputation: inline workers=1 path now calls _compute_single_state_group_counties instead of copy-pasting it - Enable aca_ptc, eitc, and refundable_ctc targets at all levels in target_config.yaml (remove outdated #7748 disable comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Geography is fully deterministic from (n_records, n_clones, seed) via assign_random_geography, so the .npz file was redundant. publish_local_area already regenerates from seed. Removing the artifact and its only consumer (stacked_dataset_builder.py) eliminates a divergent code path that had to stay in sync. The modal_app/worker_script.py still uses load_geography, so the functions remain in clone_and_assign.py for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…coped checkpoints - Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI) - Remove geography.npz dependency from Modal pipeline; workers regenerate geography deterministically from (n_records, n_clones, seed) - Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset to auto-clear stale checkpoints when inputs change - Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup, geo_labels) - Stop uploading source_imputed H5 as intermediate; promote-dataset uploads at promotion time instead - Default skip_download=True in Modal local_area (reads from volume) - Remove _upload_source_imputed from remote_calibration_runner - Clean up huggingface.py: remove geography/blocks/geo_labels from download and upload functions - ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keep upload-dataset and skip_download=False defaults so the full pipeline (data_build → calibrate → stage-h5s) works via HF transport. skip_download is available as opt-in for local push-to-modal workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The data_build.py upload step now pushes source_imputed to calibration/source_imputed_stratified_extended_cps.h5 on HF so the downstream calibration pipeline (build-matrices, calibrate) can download it. This closes the gap in the all-Modal pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --detach to all 7 modal run commands in Makefile so long-running jobs survive terminal disconnects - Add --county-level to build-matrices (required for county precomputation) - Add N_CLONES variable (default 430) and pass --n-clones to build-matrices, stage-h5s, and stage-national-h5 - Plumb n_clones through Modal scripts: build_package entrypoint, coordinate_publish, and coordinate_national_publish (replacing hardcoded 430) - Change pipeline target to a reference card since --detach makes sequential chaining impossible Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s mismatch 1. validate_artifacts now accepts filename_remap so the national config (which records calibration_weights.npy) checks national_calibration_weights.npy 2. Worker regenerates geography when national weights have fewer clones than the regional geography Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…orts; build enhanced CPS - Regional: epochs=1000, beta=0.65, L0=1e-7, L2=1e-8 - National: epochs=4000, beta=0.65, L0=1e-4, L2=1e-12 - Both use target_config.yaml (same targets, different regularization) - Fix pipeline.py ModuleNotFoundError by adding sys.path setup - Default GPU to T4 everywhere - Re-enable enhanced_cps build and upload in pipeline step 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace per-container git clone + uv sync (858MB PyTorch/CUDA each time) with add_local_dir(copy=True) images that bake source code and deps at build time. Modal caches layers by content hash, so unchanged code skips the build entirely. - Add modal_app/images.py with shared cpu_image and gpu_image - Add modal_app/resilience.py with subprocess retry wrapper - Add .github/workflows/pipeline.yaml for auto-trigger on merge to main - Simplify all 4 Modal apps to use pre-baked images (no runtime cloning) - Fix Python 3.11→3.13 mismatch in remote_calibration_runner Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The uv.lock uses revision=3 format which requires uv 0.8+. Without pinning, pip may install an older uv that can't parse it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

--locked checks pyproject.toml ↔ uv.lock consistency, which fails due to uv version differences between local and container. --frozen installs exactly what's in the lockfile without the consistency check, which is correct for a baked image where the lockfile is authoritative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57

baogorek and others added 27 commits March 20, 2026 11:32

modal

079b926

fix fixture to address failing tests

0bfd65b

make tests optional in when building data in modal

b9eea30

make sure datasets upload when ecps is not required

16a1e5c

modal pipeline

835db5a

calibration pipeline nits

8206abe

removing old artifacts

7d5e8cf

lower to expected timeout

54db9fd

adding functions to container

9acc35c

loop validation into pipeline

7b78405

fix .put_file

d99163f

capture outputs

dd195b5

Enable enhanced_cps in build-data-modal target

944de7d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Format modal_app files with ruff (CI uses ruff, not black)

5f52409

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from ffca0a7 to 5f52409 Compare March 20, 2026 15:36

baogorek and others added 2 commits March 20, 2026 11:42

Pin uv>=0.8 in Modal image to match lockfile revision format

b88317d

The uv.lock uses revision=3 format which requires uv 0.8+. Without pinning, pip may install an older uv that can't parse it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blend entity values on would_file draws; fix entity weights#611

Blend entity values on would_file draws; fix entity weights#611
baogorek wants to merge 30 commits intomainfrom
fix-would-file-blend-and-entity-weights

baogorek commented Mar 16, 2026 •

edited

Loading

Uh oh!

baogorek commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baogorek commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Pre-baked Modal images (eliminate runtime installs)

2. End-to-end pipeline orchestrator (pipeline.py)

3. Auto-trigger on merge to main

Supporting fixes (earlier commits)

Deployment flow on merge

Test plan

Uh oh!

baogorek commented Mar 17, 2026

Removed stacked_dataset_builder.py and geography.npz artifact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baogorek commented Mar 16, 2026 •

edited

Loading

2. End-to-end pipeline orchestrator (`pipeline.py`)

Removed `stacked_dataset_builder.py` and `geography.npz` artifact