feat: Add FASTQ-based species group detection utility (#881)#1089
Open
Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Open
feat: Add FASTQ-based species group detection utility (#881)#1089Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this solve?
Currently, researchers must have fully genotyped samples (VCF/variant calls) before identifying which MalariaGEN resource (Ag3, Af1, Amin1) their data belongs to. This creates a significant barrier for researchers who only have access to raw FASTQ sequential reads—especially those in endemic regions lacking heavy compute resources.
This PR establishes a lightweight, computationally cheap utility (
malariagen_data.identify_taxon) that processes raw FASTQ files via streaming. It outputs a confidence-scored species group assignment and resource routing recommendation without requiring a full genotyping pipeline, allowing researchers to quickly establish which resource to utilize.How does it solve it?
screedlibrary to stream and subsample raw fastq andfastq.gzsequences, capping at 10,000 random reads to keep memory overhead incredibly low.gambiae_complex,funestus_subgroup, andstephensi..predicted_group,.recommended_resource,.confidencethresholds, and nested probability score candidates.Relevant Dependencies
screedmapping library for quick fastq streaming.scikit-learnto handle naive classification probability vectors on N-gram vectors.Note: This PR is a prerequisite demonstrator intended for GSoC 2026 application review for the machine-learning taxon classifier project. It perfectly models the algorithm layout but utilizes a mocked local training set of nucleotide combinations to represent the pipeline without downloading full Ag3/Af1 references.
Relevant issue numbers
Closes #881
Testing done
.fastq.gzstreaming tests verifying random read processing.