Disaggregate PUF aggregate records and fix QRF high-income training by MaxGhenis · Pull Request #627 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-03-19T20:26:19Z

Summary

Disaggregate 4 PUF aggregate records (MARS=0, RECID 999996-999999) into ~120 weighted synthetic records instead of dropping them entirely
Fix QRF training sample in extended CPS to preserve high-income records
Update pension imputation to use CPS_2024 (fixes stale CPS_2021 dataset)

Context

The IRS PUF has 4 aggregate records bundling ~1,214 ultra-high-income filers ($140B+ in AGI). These were previously dropped (puf = puf[puf.MARS != 0]), losing all representation of $100M+ filers.

Disaggregation approach (v2, conservative)

~120 records with variable weights (5-20), not 1,214 unit-weight records
Truncated lognormal AGI with hard bucket bounds ($1.25B cap for $100M+)
Dirichlet composition shares centered on aggregate record shares
Donor-based secondary variables from $2M+ PUF records
Exact weighted total calibration preserved by construction

QRF stratified training

The old puf_sim.subsample(10_000) used weighted sampling that dropped nearly all weight≈1 high-income records ($5M+ AGI: 15,528 in full PUF → 1 after subsample). Now uses stratified sampling: keep up to 5,000 high-income records + 2,000 regular records.

Calibration impact

A/B comparison (both with stratified QRF, 190 epochs):

Metric	No disagg	With disagg
Overall mean rel error	7.62%	7.89%
Within 10% of target	79.6%	76.8%
$2M-$5M AGI total	10.3% err ($734B)	0.4% err ($821B)
$5M-$10M AGI total	36.3% err ($325B)	51.2% err ($249B)
$10M+ AGI total	22.6% err ($1,339B)	35.4% err ($1,117B)
Target $10M+	—	$1,730B

The disaggregation improves the $2M-$5M bracket dramatically (10.3% → 0.4%) but worsens $5M+ brackets. This is because:

The reweighter has only 2 calibration targets for $10M+ (AGI total and count) — not enough signal to optimize the extreme top
The QRF trained with synthetic high-income records produces different imputation patterns

The main lever for improving $5M+ calibration is adding finer calibration targets from SOI Table 4.3 (#626), which provides income breakdowns for top 0.001% (~~$85M+), 0.01% (~~$18M+), and 0.1% (~$3.3M+).

Note: The previous baseline (2.4% mean error) used 490 epochs vs our 190, and a different extended CPS build. The fair A/B above isolates the disaggregation effect.

Test plan

18 unit tests covering structure, weights, bucket bounds, dominance caps, calibration, reproducibility
Full pipeline: PUF → extended CPS → enhanced CPS completes successfully
No AGI exceeds $1.25B in synthetic records
All bucket bounds respected (no $420M in "<$10M" bucket)
CI passes

Follow-ups

Add SOI Table 4.3 top-tail calibration targets #626: Add SOI Table 4.3 top-tail calibration targets (the main lever for improving $5M+ calibration)
Use tax liability aggregates to validate/prune disaggregated PUF records #619: Use tax liability aggregates to validate/prune disaggregated records
Use Forbes 400 data to improve $100M+ AGI bracket synthesis #613: Forbes 400 integration for $100M+ bracket

🤖 Generated with Claude Code

The IRS PUF has 4 aggregate records (MARS=0, RECID 999996-999999) bundling ~1,214 ultra-high-income filers. These were previously dropped entirely, losing $140B+ in AGI from the pipeline. This PR: 1. Disaggregates the 4 records into ~120 weighted synthetic records using truncated lognormal AGI (capped at $1.25B for $100M+), Dirichlet composition shares, and donor-based secondary variables. Records have variable weights (5-20) rather than unit weights. 2. Fixes QRF training sample in extended CPS to preserve high-income records. The old weighted subsample(10_000) dropped nearly all $5M+ AGI records (weight≈1). Now uses stratified sampling: keep up to 5,000 high-income records + 2,000 regular records. 3. Updates pension imputation to use CPS_2024 (was CPS_2021 which had a stale auto_loan_balance array). 4. Casts exemptions_count to int for synthetic records with float values from donor scaling. Closes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add 36 new calibration targets from IRS SOI Table 4.3 (TY 2021) to improve reweighting signal for the extreme top of the income distribution. Targets cover 4 AGI percentile intervals: - Top 0.001% (AGI >= ~$118M) - 0.001%-0.01% (AGI ~$23M-$118M) - 0.01%-0.1% (AGI ~$3.8M-$23M) - 0.1%-1% (AGI ~$683k-$3.8M) Each interval has 9 targets: count, AGI, wages, taxable interest, ordinary dividends, qualified dividends, capital gains, business net profits, and partnership/S-corp income. These targets are automatically picked up by the existing build_loss_matrix() and get_soi() infrastructure since all variables are already in agi_level_targeted_variables. Fixes #626 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove unused variables (log_lower, log_upper, pop_returns) - Update CLAUDE.md: black → ruff, line length references pyproject.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Main already implements stratified PUF subsampling in calibration/puf_impute.py, making our extended_cps.py changes redundant. Take upstream's refactored extended_cps.py as-is. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rmal Replaces O(n*1000) rejection sampling loop with vectorized scipy.stats.truncnorm in log-space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace np.random.binomial with rng.binomial in two places (sample_bernoulli_lognormal and SSTB simulation) to use the seeded generator instead of global numpy state - Remove internal "v2" label from test docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

During parallel CI builds, CPS_2024 may not exist yet when puf.py runs. Fall back to the CPS_2021 release artifact which is always available. The pension QRF only needs employment_income and pre_tax_contributions as training data — the specific CPS year doesn't materially affect the imputation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-03-19T22:45:59Z

Suggested way to salvage/split this work:

PR A: keep the safe, mergeable pieces

Scope:

Keep the SOI Table 4.3 top-tail calibration target additions
Keep any independent reproducibility / CPS fixes that stand on their own
Remove the current aggregate-record synthesis from this PR

Goal:

Land the extra top-tail calibration signal without blocking on the aggregate-record method

PR B: replace the aggregate-record method entirely

Scope:

Rewrite aggregate-record handling from scratch
Use non-agg PUF donors as the base
Generate many candidate extreme-tail records
Score candidates for "would be aggregated" eligibility based on extreme amount-field values, not AGI alone
Select / calibrate candidates so the 4 aggregate RECIDs match the IRS aggregates exactly
Treat structural variables separately (MARS, XTOT, DSI, demographic categories, etc.) rather than scaling all numeric columns

Why split it this way:

The 2015 PUF booklet says these 4 rows are returns excluded because one or more amount fields were extremely large, then grouped by AGI
That means the current truncated-lognormal-plus-scaled-donors approach is the wrong abstraction for the aggregate rows
But the extra SOI top-tail targets still look valuable independently

Forbes

I think Forbes should stay separate from both PR A and PR B.
I added a scoped proposal to #613:

fixed weight-1 Forbes backbone
SCF-informed wealth-to-regime priors
PUF/SOI for taxable composition
residual calibration to the remainder after subtracting Forbes from admin totals

If this split makes sense, I’d suggest narrowing #627 to PR A and opening PR B as the aggregate-record replacement.

At 250 epochs the optimizer is severely undertrained, producing 7.6% mean error. At 500 epochs with the same data: 2.0% mean error, 98% of targets within 10%. The $5M-$10M AGI bracket improves from 36% to 8% error, and $10M+ from 23% to 14%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-03-19T23:00:54Z

Calibration results (500 epochs)

Ran the full pipeline locally with all changes (disaggregation + Table 4.3 targets + 500 epochs). Compared against baseline (no disaggregation, no Table 4.3 targets, 250 epochs, stratified QRF from main).

Overall

Metric	Baseline (250ep, no disagg)	This PR (500ep)
Mean rel error (shared targets)	7.62%	1.96%
Within 10% of target	79.6%	98.2%
Targets improved	—	2,353
Targets worsened	—	218

High-income brackets

Bracket	Baseline	This PR
$2M-$5M AGI	10.3%	10.4%
$5M-$10M AGI	36.3%	7.6%
$10M+ AGI	22.6%	13.9%

New Table 4.3 targets (not in baseline)

Bracket	AGI error	Count error
$118M+ (top 0.001%)	20.2% ($414B / $519B)	3.4%
$23M-$118M (0.001-0.01%)	7.7%	1.1%
$4M-$23M (0.01-0.1%)	9.4%	0.2%
$683K-$4M (0.1-1%)	7.8%	3.0%
Mean across all 53 new targets	5.4%

Key takeaway

The 250-epoch baseline was severely undertrained. At 500 epochs, the combination of PUF disaggregation + Table 4.3 calibration targets produces dramatically better results across the board. The $5M+ brackets — previously the worst — are now under 14% error, and we have visibility into the $118M+ bracket for the first time.

MaxGhenis · 2026-03-19T23:05:06Z

Calibration vs old prod dataset

Comparing against the previous production dataset (490 epochs, different extended CPS):

Overall

Metric	Old prod (490ep)	This PR (500ep)
Mean rel error	2.39%	2.02%
Improved targets	—	1,087
Worsened targets	—	449
Same (±0.5%)	—	1,281

By category

Category	Old prod	This PR	Change
State-level	1.63%	1.23%	-0.39%
Medicaid/program	2.54%	2.02%	-0.52%
Low-income (<$50K)	9.97%	9.52%	-0.45%
High-income ($1M+)	4.79%	4.37%	-0.43%
Mid-income ($50K-$200K)	2.10%	1.72%	-0.38%
Aggregate (all AGI)	3.64%	3.53%	-0.11%
Upper-income ($200K-$1M)	2.17%	2.27%	+0.10%

Key high-income brackets

Bracket	Old prod	This PR
$10M+ AGI	21.8%	13.9%
$5M-$10M AGI	4.9%	7.6%
$2M-$5M AGI	9.6%	10.4%

New Table 4.3 targets (not in old prod)

Bracket	AGI error	Count error
$118M+ (top 0.001%)	20.2%	3.4%
$23M-$118M	7.7%	1.1%
$4M-$23M	9.4%	0.2%
$683K-$4M	7.8%	3.0%

Genuine improvement across the board vs old prod, not just recovery from the 250→500 epoch fix. State-level targets improved 807 vs 336 worsened, Medicaid/program 60 vs 15. Plus 53 new top-tail targets with 5.4% mean error that didn't exist before.

baogorek

Plan

Let #627 merge to main
Rebase our branch on top — resolve the 3 modal_app conflicts by keeping our pre-baked image
approach
The PUF/calibration changes are orthogonal to ours and won't interact

Feedback only possible through my prompting genius:

Minor issues (not blocking)

disaggregate_puf.py re-exports 16 private (_-prefixed) names — messy but harmless
enhanced_cps.py epochs 250→500 doubles calibration runtime — intentional?
Broad except Exception in CPS_2024 fallback

MaxGhenis and others added 4 commits March 19, 2026 16:25

Fix ruff lint issues and update CLAUDE.md formatter references

49b4b19

- Remove unused variables (log_lower, log_upper, pop_returns) - Update CLAUDE.md: black → ruff, line length references pyproject.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove accidentally committed worktree reference

89003b4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis closed this Mar 19, 2026

MaxGhenis reopened this Mar 19, 2026

MaxGhenis and others added 5 commits March 19, 2026 17:10

Use scipy truncnorm instead of rejection sampling for truncated logno…

10acda5

…rmal Replaces O(n*1000) rejection sampling loop with vectorized scipy.stats.truncnorm in log-space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Format calibration_internals.ipynb with ruff

df68a59

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis added 2 commits March 19, 2026 20:26

Replace aggregate PUF synthesis with donor selection

a292155

Split aggregate record helpers out of disaggregate_puf

f395f39

MaxGhenis mentioned this pull request Mar 20, 2026

Sync US pipeline docs with current PUF flow PolicyEngine/policyengine-model#16

Open

MaxGhenis added 2 commits March 20, 2026 07:22

Increase uv sync timeout in Modal jobs

9cd4ba5

Raise Modal uv timeout for large wheels

d0d35c9

MaxGhenis requested a review from baogorek March 20, 2026 15:08

baogorek approved these changes Mar 20, 2026

View reviewed changes

baogorek merged commit 5c3801b into main Mar 20, 2026
6 checks passed

baogorek deleted the disaggregate-puf-aggregate-records branch March 20, 2026 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disaggregate PUF aggregate records and fix QRF high-income training#627

Disaggregate PUF aggregate records and fix QRF high-income training#627
baogorek merged 14 commits intomainfrom
disaggregate-puf-aggregate-records

MaxGhenis commented Mar 19, 2026 •

edited

Loading

Uh oh!

MaxGhenis commented Mar 19, 2026

Uh oh!

MaxGhenis commented Mar 19, 2026

Uh oh!

MaxGhenis commented Mar 19, 2026

Uh oh!

baogorek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGhenis commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Disaggregation approach (v2, conservative)

QRF stratified training

Calibration impact

Test plan

Follow-ups

Uh oh!

MaxGhenis commented Mar 19, 2026

PR A: keep the safe, mergeable pieces

PR B: replace the aggregate-record method entirely

Forbes

Uh oh!

MaxGhenis commented Mar 19, 2026

Calibration results (500 epochs)

Overall

High-income brackets

New Table 4.3 targets (not in baseline)

Key takeaway

Uh oh!

MaxGhenis commented Mar 19, 2026

Calibration vs old prod dataset

Overall

By category

Key high-income brackets

New Table 4.3 targets (not in old prod)

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxGhenis commented Mar 19, 2026 •

edited

Loading