Skip to content

feat: add LD pruning and admixture data preparation#1059

Open
31puneet wants to merge 7 commits intomalariagen:masterfrom
31puneet:feat/ld-pruning-standalone
Open

feat: add LD pruning and admixture data preparation#1059
31puneet wants to merge 7 commits intomalariagen:masterfrom
31puneet:feat/ld-pruning-standalone

Conversation

@31puneet
Copy link
Contributor

@31puneet 31puneet commented Mar 6, 2026

Overview

Fixes #1049

  • Adds ld_prune — a standalone LD pruning function using iterative Rogers-Huff r² via allel.locate_unlinked(),
  • Adds prepare_admixture — chains LD pruning → optional SNP downsampling max_snps → PLINK binary export (.bed/.bim/.fam)
  • Supports pre-filtering via both individual downsampling (cohort_size) and SNP downsampling max_snps
  • 10 new tests across both modules, all passing

Impact

  • Users can now perform LD pruning independently for PCA, GWAS, or
    other analyses that assume linkage equilibrium
  • Users can prepare ADMIXTURE-ready PLINK files directly from the API
    without manual preprocessing

@31puneet
Copy link
Contributor Author

31puneet commented Mar 6, 2026

Hi @jonbrenas, this PR is ready for review. I’d appreciate your feedback. Thanks!

@31puneet
Copy link
Contributor Author

31puneet commented Mar 8, 2026

Hi @jonbrenas updated the PR
ld_prune returns an xr.Dataset with caching, and prepare_admixture wraps it with random downsampling + PLINK export following to_plink.py conventions. Tested on real Ag3 data (3L, AG1000G-BF-A) — ~10.9M variants → 658K after pruning → 50K with max_snps, valid PLINK output verified with bed_reader.

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding Admixture functionalities to the API

2 participants