Skip to content

Blend entity values on would_file draws; fix entity weights#611

Open
baogorek wants to merge 30 commits intomainfrom
fix-would-file-blend-and-entity-weights
Open

Blend entity values on would_file draws; fix entity weights#611
baogorek wants to merge 30 commits intomainfrom
fix-would-file-blend-and-entity-weights

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Mar 16, 2026

Summary

This PR hardens the Modal pipeline for production use, covering three major areas:

1. Pre-baked Modal images (eliminate runtime installs)

  • Before: Every function call (15+ sites across 4 files) did a fresh git clone + uv sync, downloading 858MB of PyTorch/CUDA dependencies each time. ~15% failure rate from network timeouts.
  • After: Source code and dependencies are baked into Modal image layers at build time via add_local_dir(copy=True). Modal caches layers by content hash — unchanged code skips the build entirely.
  • New modal_app/images.py with shared cpu_image and gpu_image
  • New modal_app/resilience.py with subprocess retry wrapper
  • All 4 Modal apps (data_build, remote_calibration_runner, local_area, pipeline) updated to use pre-baked images
  • Fixes Python 3.11→3.13 mismatch in remote_calibration_runner

2. End-to-end pipeline orchestrator (pipeline.py)

  • 5-step orchestrated pipeline: build datasets → build calibration package → fit weights (regional + national in parallel) → build H5s + stage → finalize
  • Resume support via run metadata (survives preemption)
  • Parallel regional + national calibration fitting
  • Validation diagnostics aggregation
  • Atomic promotion workflow (staging → production)

3. Auto-trigger on merge to main

  • New .github/workflows/pipeline.yaml triggers on push to main
  • Uses modal run --detach (fire and forget — GH Actions runner exits, Modal runs independently)
  • Supports workflow_dispatch with configurable GPU/epochs/workers/skip_national

Supporting fixes (earlier commits)

  • Blend entity values on would_file draws; fix entity weights
  • Salt takeup draws with hh_id:clone_idx instead of block:hh_id
  • Fix county precomputation crashes (LA County zip_code, geography.npz removal)
  • Deduplicate county precomputation; enable aca_ptc/eitc/ctc targets
  • Configure distinct national vs regional calibration hyperparameters
  • Fix national H5 build: artifact validation remap and geography/weights mismatch

Deployment flow on merge

  1. Step 1 builds all datasets and uploads enhanced_cps_2024.h5 + cps_2024.h5 directly to HF + GCS (no promote gate)
  2. Steps 2-3 build calibration package and fit regional + national weights
  3. Step 4 builds local area H5s (states/districts/cities) + national US.h5, stages to HF staging/
  4. Promote (manual one-liner) pushes staged H5s to HF production + GCS

Test plan

  • Lint passes (ruff format)
  • Smoke test passes
  • uv.lock freshness check passes
  • Modal image builds successfully with pre-baked code (CI in progress)
  • Full pipeline smoke test: modal run modal_app/data_build.py::main --no-upload --skip-tests --skip-enhanced-cps

🤖 Generated with Claude Code

@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54
@baogorek
Copy link
Collaborator Author

Removed stacked_dataset_builder.py and geography.npz artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57
baogorek and others added 27 commits March 20, 2026 11:32
Matrix builder: precompute entity values with would_file=False alongside
the all-True values, then blend per tax unit based on the would_file draw
before applying target takeup draws. This ensures X@w matches sim.calculate
for targets affected by non-target state variables.

Fixes #609

publish_local_area: remove explicit sub-entity weight overrides
(tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight,
person_weight) that used incorrect person-count splitting. These are
formula variables in policyengine-us that correctly derive from
household_weight at runtime.

Fixes #610

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace block-based RNG salting with (hh_id, clone_idx) salting.
Draws are now tied to the donor household identity and independent
across clones, eliminating the multi-clone-same-block collision
issue (#597). Geographic variation comes through the rate threshold,
not the draw.

Closes #597

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
County precomputation crashes on LA County (06037) because
aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code
calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001'
for LA County in both precomputation and publish_local_area
so X @ w matches sim.calculate("aca_ptc").sum().

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The zip_code set for LA County (06037) was being wiped by
delete_arrays which only preserved "county". Also apply the
06037 zip_code fix to the in-process county precomputation
path (not just the parallel worker function).

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The only county-dependent variable (aca_ptc) does not depend on
would_file_taxes_voluntarily, so the entity_wf_false pass was
computing identical values. Removing it eliminates ~2,977 extra
simulation passes during --county-level builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptc/eitc/ctc targets

- Fix geography.npz n_clones: was saving unique CD count instead of
  actual clone count (line 1292 of unified_calibration.py)
- Deduplicate county precomputation: inline workers=1 path now calls
  _compute_single_state_group_counties instead of copy-pasting it
- Enable aca_ptc, eitc, and refundable_ctc targets at all levels
  in target_config.yaml (remove outdated #7748 disable comments)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geography is fully deterministic from (n_records, n_clones, seed)
via assign_random_geography, so the .npz file was redundant.
publish_local_area already regenerates from seed. Removing the
artifact and its only consumer (stacked_dataset_builder.py)
eliminates a divergent code path that had to stay in sync.

The modal_app/worker_script.py still uses load_geography, so
the functions remain in clone_and_assign.py for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coped checkpoints

- Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI)
- Remove geography.npz dependency from Modal pipeline; workers regenerate
  geography deterministically from (n_records, n_clones, seed)
- Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset
  to auto-clear stale checkpoints when inputs change
- Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup,
  geo_labels)
- Stop uploading source_imputed H5 as intermediate; promote-dataset uploads
  at promotion time instead
- Default skip_download=True in Modal local_area (reads from volume)
- Remove _upload_source_imputed from remote_calibration_runner
- Clean up huggingface.py: remove geography/blocks/geo_labels from
  download and upload functions
- ruff format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upload-dataset and skip_download=False defaults so the full
pipeline (data_build → calibrate → stage-h5s) works via HF transport.
skip_download is available as opt-in for local push-to-modal workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_build.py upload step now pushes source_imputed to
calibration/source_imputed_stratified_extended_cps.h5 on HF so the
downstream calibration pipeline (build-matrices, calibrate) can
download it. This closes the gap in the all-Modal pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --detach to all 7 modal run commands in Makefile so long-running
  jobs survive terminal disconnects
- Add --county-level to build-matrices (required for county precomputation)
- Add N_CLONES variable (default 430) and pass --n-clones to
  build-matrices, stage-h5s, and stage-national-h5
- Plumb n_clones through Modal scripts: build_package entrypoint,
  coordinate_publish, and coordinate_national_publish (replacing
  hardcoded 430)
- Change pipeline target to a reference card since --detach makes
  sequential chaining impossible

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s mismatch

1. validate_artifacts now accepts filename_remap so the national config
   (which records calibration_weights.npy) checks national_calibration_weights.npy
2. Worker regenerates geography when national weights have fewer clones
   than the regional geography

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orts; build enhanced CPS

- Regional: epochs=1000, beta=0.65, L0=1e-7, L2=1e-8
- National: epochs=4000, beta=0.65, L0=1e-4, L2=1e-12
- Both use target_config.yaml (same targets, different regularization)
- Fix pipeline.py ModuleNotFoundError by adding sys.path setup
- Default GPU to T4 everywhere
- Re-enable enhanced_cps build and upload in pipeline step 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-container git clone + uv sync (858MB PyTorch/CUDA each time)
with add_local_dir(copy=True) images that bake source code and deps at
build time. Modal caches layers by content hash, so unchanged code skips
the build entirely.

- Add modal_app/images.py with shared cpu_image and gpu_image
- Add modal_app/resilience.py with subprocess retry wrapper
- Add .github/workflows/pipeline.yaml for auto-trigger on merge to main
- Simplify all 4 Modal apps to use pre-baked images (no runtime cloning)
- Fix Python 3.11→3.13 mismatch in remote_calibration_runner

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from ffca0a7 to 5f52409 Compare March 20, 2026 15:36
baogorek and others added 2 commits March 20, 2026 11:42
The uv.lock uses revision=3 format which requires uv 0.8+.
Without pinning, pip may install an older uv that can't parse it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--locked checks pyproject.toml ↔ uv.lock consistency, which fails
due to uv version differences between local and container. --frozen
installs exactly what's in the lockfile without the consistency
check, which is correct for a baked image where the lockfile is
authoritative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants