Add VCF export support for SNP call datasets#1071
Open
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
Open
Add VCF export support for SNP call datasets#1071adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
Conversation
c951e39 to
711f103
Compare
saadte
reviewed
Mar 7, 2026
Comment on lines
+119
to
+122
| alleles = allele_chunk[j] | ||
| ref = str(alleles[0]) | ||
| alt_alleles = [str(a) for a in alleles[1:] if str(a) != ""] | ||
| alt = ",".join(alt_alleles) if alt_alleles else "." |
There was a problem hiding this comment.
How does this handle potential byte-backed nature of the values? Using str() in that case will lead to malformed vcf that cannot be used downstream.
Contributor
Author
There was a problem hiding this comment.
Thanks for pointing this out. This was something I had already been looking into while working on the exporter. I've updated the implementation to explicitly decode byte-backed allele values instead of relying on str(), and verified locally that REF/ALT are written correctly without any byte-string artifacts (e.g. b'A').
711f103 to
66b5f0f
Compare
Decode byte-backed allele values returned by snp_calls() (dtype |S1) before writing REF and ALT fields to the VCF. This prevents values like b'A' appearing in the output and ensures valid VCF formatting. All VCF exporter tests pass after this fix.
66b5f0f to
d2e4994
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for exporting SNP call datasets to Variant Call Format (VCF), enabling interoperability with common genomics tools and workflows.
Changes
VcfExportermixin following the existingPlinkConverterpatternsnp_calls_to_vcf()to export SNP calls directly to VCFAnophelesDataResourcetest_plink_converterpattern to verify output structure and dataset consistencyNotes
The implementation uses
snp_calls()as the data source so that multiallelic sites are preserved. VCF records are written incrementally from chunked genotype data to keep memory usage low when working with large datasets.closes #1054