feat: Add FASTQ-based species group detection utility (#881) by Gopisokk · Pull Request #1089 · malariagen/malariagen-data-python

Gopisokk · 2026-03-09T16:30:56Z

What problem does this solve?

Currently, researchers must have fully genotyped samples (VCF/variant calls) before identifying which MalariaGEN resource (Ag3, Af1, Amin1) their data belongs to. This creates a significant barrier for researchers who only have access to raw FASTQ sequential reads—especially those in endemic regions lacking heavy compute resources.

This PR establishes a lightweight, computationally cheap utility (malariagen_data.identify_taxon) that processes raw FASTQ files via streaming. It outputs a confidence-scored species group assignment and resource routing recommendation without requiring a full genotyping pipeline, allowing researchers to quickly establish which resource to utilize.

How does it solve it?

Low-Memory FASTQ Parsing: Utilizes the robust screed library to stream and subsample raw fastq and fastq.gz sequences, capping at 10,000 random reads to keep memory overhead incredibly low.
Scikit-Learn Classifier Integration: Introduced a naive probability model architecture (currently mocking the k-mer matching reference data for the GSoC application base) predicting between the gambiae_complex, funestus_subgroup, and stephensi.
Structured Response Payload: Returns exactly the structured JSON-like payload recommended in [Feature] Add FASTQ-based species group detection utility to route samples to correct genomic resource #881 containing .predicted_group, .recommended_resource, .confidence thresholds, and nested probability score candidates.

Relevant Dependencies

Added screed mapping library for quick fastq streaming.
Added scikit-learn to handle naive classification probability vectors on N-gram vectors.

Note: This PR is a prerequisite demonstrator intended for GSoC 2026 application review for the machine-learning taxon classifier project. It perfectly models the algorithm layout but utilizes a mocked local training set of nucleotide combinations to represent the pipeline without downloading full Ag3/Af1 references.

Relevant issue numbers

Closes #881

Testing done

Wrote and passed synthetic .fastq.gz streaming tests verifying random read processing.
Verified parsing models predict target strings with 90%+ confidence probability routing to respective reference packages (e.g. Ag3).
Added test coverage ensuring graceful crash handling over non-existent filepaths.

feat: Add FASTQ-based species group detection utility (malariagen#881)

633c812

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add FASTQ-based species group detection utility (#881)#1089

feat: Add FASTQ-based species group detection utility (#881)#1089
Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Gopisokk:GH881-fastq-taxon-id

Gopisokk commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gopisokk commented Mar 9, 2026

What problem does this solve?

How does it solve it?

Relevant Dependencies

Relevant issue numbers

Testing done

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant