Skip to content

feat: Add FASTQ-based species group detection utility (#881)#1089

Open
Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Gopisokk:GH881-fastq-taxon-id
Open

feat: Add FASTQ-based species group detection utility (#881)#1089
Gopisokk wants to merge 1 commit intomalariagen:masterfrom
Gopisokk:GH881-fastq-taxon-id

Conversation

@Gopisokk
Copy link

@Gopisokk Gopisokk commented Mar 9, 2026

What problem does this solve?

Currently, researchers must have fully genotyped samples (VCF/variant calls) before identifying which MalariaGEN resource (Ag3, Af1, Amin1) their data belongs to. This creates a significant barrier for researchers who only have access to raw FASTQ sequential reads—especially those in endemic regions lacking heavy compute resources.

This PR establishes a lightweight, computationally cheap utility (malariagen_data.identify_taxon) that processes raw FASTQ files via streaming. It outputs a confidence-scored species group assignment and resource routing recommendation without requiring a full genotyping pipeline, allowing researchers to quickly establish which resource to utilize.

How does it solve it?

  1. Low-Memory FASTQ Parsing: Utilizes the robust screed library to stream and subsample raw fastq and fastq.gz sequences, capping at 10,000 random reads to keep memory overhead incredibly low.
  2. Scikit-Learn Classifier Integration: Introduced a naive probability model architecture (currently mocking the k-mer matching reference data for the GSoC application base) predicting between the gambiae_complex, funestus_subgroup, and stephensi.
  3. Structured Response Payload: Returns exactly the structured JSON-like payload recommended in [Feature] Add FASTQ-based species group detection utility to route samples to correct genomic resource #881 containing .predicted_group, .recommended_resource, .confidence thresholds, and nested probability score candidates.

Relevant Dependencies

  • Added screed mapping library for quick fastq streaming.
  • Added scikit-learn to handle naive classification probability vectors on N-gram vectors.

Note: This PR is a prerequisite demonstrator intended for GSoC 2026 application review for the machine-learning taxon classifier project. It perfectly models the algorithm layout but utilizes a mocked local training set of nucleotide combinations to represent the pipeline without downloading full Ag3/Af1 references.

Relevant issue numbers

Closes #881

Testing done

  • Wrote and passed synthetic .fastq.gz streaming tests verifying random read processing.
  • Verified parsing models predict target strings with 90%+ confidence probability routing to respective reference packages (e.g. Ag3).
  • Added test coverage ensuring graceful crash handling over non-existent filepaths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add FASTQ-based species group detection utility to route samples to correct genomic resource

1 participant