Disaggregate PUF aggregate records and fix QRF high-income training#627
Disaggregate PUF aggregate records and fix QRF high-income training#627
Conversation
The IRS PUF has 4 aggregate records (MARS=0, RECID 999996-999999) bundling ~1,214 ultra-high-income filers. These were previously dropped entirely, losing $140B+ in AGI from the pipeline. This PR: 1. Disaggregates the 4 records into ~120 weighted synthetic records using truncated lognormal AGI (capped at $1.25B for $100M+), Dirichlet composition shares, and donor-based secondary variables. Records have variable weights (5-20) rather than unit weights. 2. Fixes QRF training sample in extended CPS to preserve high-income records. The old weighted subsample(10_000) dropped nearly all $5M+ AGI records (weight≈1). Now uses stratified sampling: keep up to 5,000 high-income records + 2,000 regular records. 3. Updates pension imputation to use CPS_2024 (was CPS_2021 which had a stale auto_loan_balance array). 4. Casts exemptions_count to int for synthetic records with float values from donor scaling. Closes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 36 new calibration targets from IRS SOI Table 4.3 (TY 2021) to improve reweighting signal for the extreme top of the income distribution. Targets cover 4 AGI percentile intervals: - Top 0.001% (AGI >= ~$118M) - 0.001%-0.01% (AGI ~$23M-$118M) - 0.01%-0.1% (AGI ~$3.8M-$23M) - 0.1%-1% (AGI ~$683k-$3.8M) Each interval has 9 targets: count, AGI, wages, taxable interest, ordinary dividends, qualified dividends, capital gains, business net profits, and partnership/S-corp income. These targets are automatically picked up by the existing build_loss_matrix() and get_soi() infrastructure since all variables are already in agi_level_targeted_variables. Fixes #626 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused variables (log_lower, log_upper, pop_returns) - Update CLAUDE.md: black → ruff, line length references pyproject.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Main already implements stratified PUF subsampling in calibration/puf_impute.py, making our extended_cps.py changes redundant. Take upstream's refactored extended_cps.py as-is. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmal Replaces O(n*1000) rejection sampling loop with vectorized scipy.stats.truncnorm in log-space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace np.random.binomial with rng.binomial in two places (sample_bernoulli_lognormal and SSTB simulation) to use the seeded generator instead of global numpy state - Remove internal "v2" label from test docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
During parallel CI builds, CPS_2024 may not exist yet when puf.py runs. Fall back to the CPS_2021 release artifact which is always available. The pension QRF only needs employment_income and pre_tax_contributions as training data — the specific CPS year doesn't materially affect the imputation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Suggested way to salvage/split this work: PR A: keep the safe, mergeable piecesScope:
Goal:
PR B: replace the aggregate-record method entirelyScope:
Why split it this way:
ForbesI think Forbes should stay separate from both PR A and PR B.
If this split makes sense, I’d suggest narrowing #627 to PR A and opening PR B as the aggregate-record replacement. |
At 250 epochs the optimizer is severely undertrained, producing 7.6% mean error. At 500 epochs with the same data: 2.0% mean error, 98% of targets within 10%. The $5M-$10M AGI bracket improves from 36% to 8% error, and $10M+ from 23% to 14%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Calibration results (500 epochs)Ran the full pipeline locally with all changes (disaggregation + Table 4.3 targets + 500 epochs). Compared against baseline (no disaggregation, no Table 4.3 targets, 250 epochs, stratified QRF from main). Overall
High-income brackets
New Table 4.3 targets (not in baseline)
Key takeawayThe 250-epoch baseline was severely undertrained. At 500 epochs, the combination of PUF disaggregation + Table 4.3 calibration targets produces dramatically better results across the board. The $5M+ brackets — previously the worst — are now under 14% error, and we have visibility into the $118M+ bracket for the first time. |
Calibration vs old prod datasetComparing against the previous production dataset (490 epochs, different extended CPS): Overall
By category
Key high-income brackets
New Table 4.3 targets (not in old prod)
Genuine improvement across the board vs old prod, not just recovery from the 250→500 epoch fix. State-level targets improved 807 vs 336 worsened, Medicaid/program 60 vs 15. Plus 53 new top-tail targets with 5.4% mean error that didn't exist before. |
baogorek
left a comment
There was a problem hiding this comment.
Plan
- Let #627 merge to main
- Rebase our branch on top — resolve the 3 modal_app conflicts by keeping our pre-baked image
approach - The PUF/calibration changes are orthogonal to ours and won't interact
Feedback only possible through my prompting genius:
Minor issues (not blocking)
- disaggregate_puf.py re-exports 16 private (_-prefixed) names — messy but harmless
- enhanced_cps.py epochs 250→500 doubles calibration runtime — intentional?
- Broad except Exception in CPS_2024 fallback
Summary
Context
The IRS PUF has 4 aggregate records bundling ~1,214 ultra-high-income filers ($140B+ in AGI). These were previously dropped (
puf = puf[puf.MARS != 0]), losing all representation of $100M+ filers.Disaggregation approach (v2, conservative)
QRF stratified training
The old
puf_sim.subsample(10_000)used weighted sampling that dropped nearly all weight≈1 high-income records ($5M+ AGI: 15,528 in full PUF → 1 after subsample). Now uses stratified sampling: keep up to 5,000 high-income records + 2,000 regular records.Calibration impact
A/B comparison (both with stratified QRF, 190 epochs):
The disaggregation improves the $2M-$5M bracket dramatically (10.3% → 0.4%) but worsens $5M+ brackets. This is because:
The main lever for improving $5M+ calibration is adding finer calibration targets from SOI Table 4.3 (#626), which provides income breakdowns for top 0.001% (
$85M+), 0.01% ($18M+), and 0.1% (~$3.3M+).Note: The previous baseline (2.4% mean error) used 490 epochs vs our 190, and a different extended CPS build. The fair A/B above isolates the disaggregation effect.
Test plan
Follow-ups
🤖 Generated with Claude Code