Skip to content

feat: add mutation pattern analysis#1767

Open
ivan-aksamentov wants to merge 9 commits into
masterfrom
feat/mutation-pattern-analysis
Open

feat: add mutation pattern analysis#1767
ivan-aksamentov wants to merge 9 commits into
masterfrom
feat/mutation-pattern-analysis

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

@ivan-aksamentov ivan-aksamentov commented May 26, 2026


Test web: https://nextstrain--nextclade--pr-1767.previews.neherlab.click/?dataset-server=gh:@feat/mutation-pattern-analysis@&dataset-name=nextstrain/orthoebolavirus/bdbv&input-fasta=example


Nextclade's qc.snpClusters rule detects dense private substitution clusters as a quality signal, but treats all substitution types uniformly. Biological editing enzymes - ADAR (adenosine deaminase acting on RNA) and APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like) - produce directional substitution clusters that are expected in some viral lineages and should not inflate QC penalty scores.

This PR adds a configurable mutation pattern analysis step. Dataset authors define named patterns in pathogen.json, each selecting private substitutions by reference/query nucleotide and optional reference-context regex motifs. The analysis produces per-pattern match counts, per-type breakdowns, and pattern-local clusters independent from the global QC rule. Existing datasets without mutationPatterns config are unaffected - the QC rule falls back to qc.snpClusters parameters as before.

The event model uses tagged unions ("type": "nucSubstitution") so future event types (amino acid changes, insertions) can be added without breaking the output structure. Trinucleotide context is defined against the global reference sequence rather than the nearest tree node, matching mutational signature conventions where the enzyme acts on the physical genome.

New configuration

Add mutationPatterns to pathogen.json:

Use this url for $schema temporarily to get autocomplete and docs for new objects when editing pathogen.json (revert before releasing):

https://raw.githubusercontent.com/nextstrain/nextclade/refs/heads/feat/mutation-pattern-analysis/packages/nextclade-schemas/input-pathogen-json.schema.json
"mutationPatterns": {
  "patterns": [
    {
      "id": "adar",
      "name": "ADAR-like RNA editing",
      "description": "A-to-I editing observed as A>G and T>C on complementary strand",
      "events": [
        { "type": "nucSubstitution", "ref": ["A"], "qry": ["G"] },
        { "type": "nucSubstitution", "ref": ["T"], "qry": ["C"] }
      ],
      "cluster": { "windowSize": 50, "cutoff": 3 }
    }
  ]
}

Each pattern contains:

  • id, name, description - identification and display metadata
  • events[] - filters selecting which private substitutions match this pattern. Each event specifies ref and qry nucleotide arrays (IUPAC ambiguity codes supported) and optional motifs array of regex patterns matched against the reference sequence at the substitution site
  • cluster (optional) - pattern-local sliding window clustering with windowSize (nucleotides) and cutoff (minimum events per window). Independent from qc.snpClusters

Analysis output

Per-sequence, per-pattern results in JSON/NDJSON (mutationPatterns.results[]):

  • matches[] - all private substitutions matching the pattern, with local reference context and motif match positions
  • eventTypeCounts[] - per-substitution-type breakdown (e.g., A>G: 8, T>C: 6)
  • clusters[] - dense regions where matched events exceed the cutoff within the sliding window, with per-cluster event lists and type counts
  • counts - summary: total matches, clustered events, cluster count

Example JSON output (one pattern, one cluster with two events):

"mutationPatterns": {
  "results": [{
    "id": "adar",
    "name": "ADAR-like RNA editing",
    "description": "A-to-I editing observed as A>G and T>C on complementary strand",
    "counts": { "matches": 14, "clustered": 6, "clusters": 1 },
    "eventTypeCounts": [
      { "type": "nucSubstitution", "refNuc": "A", "qryNuc": "G", "count": 8 },
      { "type": "nucSubstitution", "refNuc": "T", "qryNuc": "C", "count": 6 }
    ],
    "clusters": [{
      "start": 5003, "end": 5033, "count": 6,
      "events": [
        { "type": "nucSubstitution", "pos": 5003, "refNuc": "A", "qryNuc": "G",
          "refContext": ["A", "A", "G"], "motifMatches": [] },
        { "type": "nucSubstitution", "pos": 5010, "refNuc": "T", "qryNuc": "C",
          "refContext": ["G", "T", "A"], "motifMatches": [] }
      ],
      "eventTypeCounts": [
        { "type": "nucSubstitution", "refNuc": "A", "qryNuc": "G", "count": 4 },
        { "type": "nucSubstitution", "refNuc": "T", "qryNuc": "C", "count": 2 }
      ]
    }],
    "matches": ["... (14 matched events, omitted for brevity)"]
  }]
}

TSV/CSV uses per-pattern dynamic columns following the founderMuts['name'].field convention. Each configured pattern produces 4 columns with summary data suitable for spreadsheets and simple scripts. Exhaustive per-event and per-cluster detail is available in JSON/NDJSON output.

Example with one pattern (adar):

mutationPatterns['adar'].counts.matches     14
mutationPatterns['adar'].counts.clustered   6
mutationPatterns['adar'].counts.clusters    1
mutationPatterns['adar'].eventTypeCounts    nucSubstitution:A>G:8,nucSubstitution:T>C:6

With two patterns (adar + apobec), 8 columns appear (4 per pattern):

mutationPatterns['adar'].counts.matches       14
mutationPatterns['adar'].counts.clustered     6
mutationPatterns['adar'].counts.clusters      1
mutationPatterns['adar'].eventTypeCounts      nucSubstitution:A>G:8,nucSubstitution:T>C:6
mutationPatterns['apobec'].counts.matches     8
mutationPatterns['apobec'].counts.clustered   3
mutationPatterns['apobec'].counts.clusters    1
mutationPatterns['apobec'].eventTypeCounts    nucSubstitution:C>T:8

The mutation patterns column group has an independent toggle (includeMutPatterns) in the web export column config UI.

Web visualization

  • QC column tooltip lists clusters (TODO: QC tooltip text need to be structured, not plain text)
  • Mutation column tooltip shows pattern-grouped clusters with nucleotide mutation badges
  • Orange-bordered cluster markers in the sequence view (configurable height: Off/Top/Bottom/Full)
01 02 03

Work items

  • Add mutation pattern config types (MutationPatternsConfig, MutationPatternEvent::NucSubstitution) and analysis engine (analyze_mutation_patterns()) with event matching, regex motif compilation, sliding-window clustering, and type counting
  • Add NucSubWithContext pairing each substitution with its local reference genome context
  • Refactor rule_snp_clusters() to consume pre-computed &[MutationPatternCluster] instead of recomputing clusters internally
  • Wire analysis into the pipeline between private mutation calling and QC scoring; add mutationPatterns to JSON/NDJSON output and CSV/TSV columns for pattern metadata, match counts, and cluster details
  • Add SequenceMarkerCluster component and MutationPatternsSection tooltip with pattern-grouped cluster display
  • Document mutationPatterns config in pathogen.json docs and new column definitions in TSV/CSV docs; regenerate JSON schemas
  • Add unit tests covering context extraction, type counting, cluster detection, type/motif filtering, IUPAC matching, legacy config fallback, config precedence, QC scoring, and output shape
  • Redesign TSV output to use per-pattern dynamic columns (mutationPatterns['id'].field) following the founderMuts convention, with includeMutPatterns toggle in web UI

- Tagged union event model (MutationPatternEvent::NucSubstitution) allows adding event types without changing output structure
- Regex motifs matched against reference sequence replace fixed trinucleotide IUPAC context, supporting arbitrary-length patterns
- Context uses global reference genome, matching mutational signature conventions where the enzyme acts on the physical genome
- QC scoring uses pre-computed unfiltered clusters, independent of pattern-specific type filtering
- window_size and cluster_cut_off on QcRulesConfigSnpClusters become serde-defaulted for backward compatibility with datasets omitting them
- Wire analyze_mutation_patterns() between private mutation calling and QC scoring
- Add mutationPatterns to JSON/NDJSON output and 9 CSV/TSV columns with pattern-delimited multi-pattern support
- Fix inclusive-to-half-open range conversion in cluster range formatters
- Cluster markers rendered as orange-bordered SVG rectangles with configurable height (Off/Top/Bottom/Full)
- Mutation column tooltip groups events by configured pattern, showing cluster ranges and nucleotide badges
- Add mutationPatterns section to pathogen.json docs with ADAR and APOBEC examples
- Add mutationPatterns.* column definitions to TSV/CSV output docs
- Regenerate JSON/YAML schemas with schemars example annotations on all mutation pattern types
@github-actions
Copy link
Copy Markdown

…utput

- Replace pipe-delimited multi-pattern cells with per-entity columns following the founderMuts['name'].field convention
- Column names: mutationPatterns['id'].counts.{matches,clustered,clusters}, mutationPatterns['id'].eventTypeCounts
- Add include_mut_patterns flag to CsvColumnConfig for independent toggling
- Remove MutPatterns from CSV_COLUMN_CONFIG_MAP_DEFAULT (columns now dynamic)
- Remove dead code: format_mutation_patterns_field, format_mutation_clusters_ranges, format_mutation_clusters_events, format_mutation_pattern_event, PATTERN_DELIMITER
- Add mutation_pattern_keys to AnalysisInitialData
- Propagate mutation_pattern_keys through CSV/XLSX/WASM call chains
- Add mutation patterns checkbox to web export column config UI
- Update TypeScript types for CsvColumnConfig and AnalysisInitialData
- Test mut_pattern_cols() helper with single and special-character pattern IDs
- Test prepare_headers() with 0, 1, 2 patterns and with patterns disabled
- Verify columns not emitted when include_mut_patterns is false
- Verify column ordering follows pattern key order from config
@jameshadfield
Copy link
Copy Markdown
Member

Richard brainstormed this idea last week and here it is - awesome work @ivan-aksamentov

@rneher
Copy link
Copy Markdown
Member

rneher commented May 28, 2026

Very cool! I think this will be pretty useful. My only thought at the moment is that we typically evaluate these SNP cluster on private mutations. But for the sequence context you implementation picks the reference context (I think). This would be kind of hard to change I imagine since the local context is only known as a list of differences to the reference. For viruses for which we currently think this would be useful, this isn't an issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants