Skip to content

Add ihdp_covariates.csv for IHDP Dataset Simulator#14

Open
Gitanaskhan26 wants to merge 5 commits into
pgmpy:mainfrom
Gitanaskhan26:main
Open

Add ihdp_covariates.csv for IHDP Dataset Simulator#14
Gitanaskhan26 wants to merge 5 commits into
pgmpy:mainfrom
Gitanaskhan26:main

Conversation

@Gitanaskhan26

@Gitanaskhan26 Gitanaskhan26 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Adds ihdp_covariates.csv (747 rows × 26 columns: treatment + x1-x25) to the pgmpy/example_datasets HuggingFace repo. This is the fixed, real-covariate design matrix that IHDPDataset loads via _get_raw_data("ihdp_covariates.csv") on every instantiation.

Related Issue : pgmpy/pgmpy#3420

Provenance

  • Source: ihdp_npci_1.csv from the CEVAE repository (github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP/csv), one of the standard NPCI-generated (Dorie, 2016) IHDP replications built on real covariates from the Infant Health and Development Program (Hill, 2011).
  • Only treatment and x1-x25 are kept. y_factual, y_cfactual, mu0, mu1 are dropped — those are outcome columns specific to that one CEVAE replication; IHDPDataset regenerates outcomes itself from these fixed covariates, so shipping baked-in outcomes would be actively misleading (and would tie the package to one arbitrary replication out of 1000).
  • Generation script: prep.py (included in this PR), which validates its own output before writing — shape, treated/control counts, and per-column value ranges — since this file becomes permanent shared infrastructure once uploaded.

Validated properties (asserted by prep.py)

  • Shape: 747 × 26. 139 treated / 608 control — the standard post-selection-bias IHDP sample size used throughout the literature.
  • x1-x6 (birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age) are continuous and already standardized (mean≈0, std=1) as part of the upstream NPCI/CEVAE pipeline -> this extraction doesn't standardize them itself, it inherits that property.
  • x7-x25 are binary (0/1) site and demographic indicators, with one documented exception: x14 ("first" — firstborn indicator) is {1,2}-coded, not {0,1}. This isn't a data error — EconML's own port carries an explicit comment doing the equivalent adjustment for this same variable, so {1,2} is the literature-standard coding for it and has been left as-is.
  • Covariates are identical across all CEVAE replications (ihdp_npci_1.csv through _1000.csv) by construction, so this file is replication-agnostic ; extracting from replication 1 is representative of all of them.

Why ihdp_covariates.csv file instead of Raw CEVAE CSV

IHDPDataset treats IHDP as a real simulator: covariates are fixed, outcomes are generated fresh per instantiation from a parameterized response surface. A file with baked-in y_factual/mu0/mu1 would suggest those are meant to be read directly rather than regenerated.

What x1-x25 actually are

Column names stay x1-x25 (matching every paper/package that reports IHDP numbers), but here's what each one is, for anyone browsing this dataset who wants to know. Verified by exact value reconstruction against EconML's independently-maintained raw covariate file — not assumed from documentation (see prep.py's COVARIATE_INFO for the full verification method).

Column Name Meaning
x1-x6 bw, b.head, preterm, birth.o, nnhealth, momage Birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age (continuous)
x7-x9 sex, twin, b.marr Infant sex, twin birth, mother married (binary)
x10-x12 mom.lths, mom.hs, mom.scoll Mother's education level: <high school / high school / some college (binary dummies)
x13, x15-x18 cig, booze, drugs, work.dur, prenatal Smoked / drank / used drugs during pregnancy, worked during pregnancy, received prenatal care (binary)
x14 first Firstborn — coded {1,2}, not {0,1} (genuine upstream convention, not an error)
x19-x25 site1-site7 Trial site indicator, 7 sites (binary)

Checklist

  • File uploaded to pgmpy/example_datasets on HuggingFace
  • Confirmed accessible via _get_raw_data("ihdp_covariates.csv")
  • prep.py included in this PR for reproducibility
  • Byte-identical regeneration confirmed (re-running prep.py
    against ihdp_npci_1.csv reproduces this file exactly — verified
    via checksum before upload)

- ihdp_covariates.csv: 747 rows x 26 cols (treatment + x1-x25)
  Extracted from CEVAE/NPCI replication, outcomes dropped.
  139 treated / 608 control, x1-x6 pre-standardized, x7-x25 binary.
- ihdp_npci_1.csv: Source file for provenance
- prep.py: Extraction script with validation assertions
@Gitanaskhan26

Copy link
Copy Markdown
Author

@ankurankan, can you please review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: PR In Progress

Development

Successfully merging this pull request may close these issues.

2 participants