Skip to content

Add VCF export support for SNP call datasets#1071

Open
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99:GH-1054-add-vcf-export
Open

Add VCF export support for SNP call datasets#1071
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99:GH-1054-add-vcf-export

Conversation

@adilraza99
Copy link
Contributor

Summary

Adds support for exporting SNP call datasets to Variant Call Format (VCF), enabling interoperability with common genomics tools and workflows.

Changes

  • Introduce a VcfExporter mixin following the existing PlinkConverter pattern
  • Implement snp_calls_to_vcf() to export SNP calls directly to VCF
  • Stream data in chunks to avoid loading the full dataset into memory
  • Integrate the exporter into AnophelesDataResource
  • Add tests following the test_plink_converter pattern to verify output structure and dataset consistency

Notes

The implementation uses snp_calls() as the data source so that multiallelic sites are preserved. VCF records are written incrementally from chunked genotype data to keep memory usage low when working with large datasets.


closes #1054

@adilraza99 adilraza99 force-pushed the GH-1054-add-vcf-export branch 6 times, most recently from c951e39 to 711f103 Compare March 7, 2026 15:59
Comment on lines +119 to +122
alleles = allele_chunk[j]
ref = str(alleles[0])
alt_alleles = [str(a) for a in alleles[1:] if str(a) != ""]
alt = ",".join(alt_alleles) if alt_alleles else "."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle potential byte-backed nature of the values? Using str() in that case will lead to malformed vcf that cannot be used downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. This was something I had already been looking into while working on the exporter. I've updated the implementation to explicitly decode byte-backed allele values instead of relying on str(), and verified locally that REF/ALT are written correctly without any byte-string artifacts (e.g. b'A').

@adilraza99 adilraza99 force-pushed the GH-1054-add-vcf-export branch from 711f103 to 66b5f0f Compare March 7, 2026 18:42
Decode byte-backed allele values returned by snp_calls() (dtype |S1)
before writing REF and ALT fields to the VCF. This prevents values
like b'A' appearing in the output and ensures valid VCF formatting.

All VCF exporter tests pass after this fix.
@adilraza99 adilraza99 force-pushed the GH-1054-add-vcf-export branch from 66b5f0f to d2e4994 Compare March 7, 2026 19:02
@adilraza99 adilraza99 marked this pull request as draft March 7, 2026 22:07
@adilraza99 adilraza99 marked this pull request as ready for review March 9, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding other file formats option

2 participants