feat: add mutation pattern analysis by ivan-aksamentov · Pull Request #1767 · nextstrain/nextclade

ivan-aksamentov · 2026-05-26T10:13:13Z

Related to: feat: add example ebolavirus mutation pattern configs nextclade_data#456 (placeholder ebolavirus dataset configs for testing - event filters and clustering parameters not yet validated)

Test web: https://nextstrain--nextclade--pr-1767.previews.neherlab.click/?dataset-server=gh:@feat/mutation-pattern-analysis@&dataset-name=nextstrain/orthoebolavirus/bdbv&input-fasta=example

Nextclade's qc.snpClusters rule detects dense private substitution clusters as a quality signal, but treats all substitution types uniformly. Biological editing enzymes - ADAR (adenosine deaminase acting on RNA) and APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like) - produce directional substitution clusters that are expected in some viral lineages and should not inflate QC penalty scores.

This PR adds a configurable mutation pattern analysis step. Dataset authors define named patterns in pathogen.json, each selecting private substitutions by reference/query nucleotide and optional reference-context regex motifs. The analysis produces per-pattern match counts, per-type breakdowns, and pattern-local clusters independent from the global QC rule. Existing datasets without mutationPatterns config are unaffected - the QC rule falls back to qc.snpClusters parameters as before.

The event model uses tagged unions ("type": "nucSubstitution") so future event types (amino acid changes, insertions) can be added without breaking the output structure. Trinucleotide context is defined against the global reference sequence rather than the nearest tree node, matching mutational signature conventions where the enzyme acts on the physical genome.

New configuration

Add mutationPatterns to pathogen.json:

Use this url for $schema temporarily to get autocomplete and docs for new objects when editing pathogen.json (revert before releasing):
https://raw.githubusercontent.com/nextstrain/nextclade/refs/heads/feat/mutation-pattern-analysis/packages/nextclade-schemas/input-pathogen-json.schema.json

"mutationPatterns": {
  "patterns": [
    {
      "id": "adar",
      "name": "ADAR-like RNA editing",
      "description": "A-to-I editing observed as A>G and T>C on complementary strand",
      "events": [
        { "type": "nucSubstitution", "ref": ["A"], "qry": ["G"] },
        { "type": "nucSubstitution", "ref": ["T"], "qry": ["C"] }
      ],
      "cluster": { "windowSize": 50, "cutoff": 3 }
    }
  ]
}

Each pattern contains:

id, name, description - identification and display metadata
events[] - filters selecting which private substitutions match this pattern. Each event specifies ref and qry nucleotide arrays (IUPAC ambiguity codes supported) and optional motifs array of regex patterns matched against the reference sequence at the substitution site
cluster (optional) - pattern-local sliding window clustering with windowSize (nucleotides) and cutoff (minimum events per window). Independent from qc.snpClusters

Analysis output

Per-sequence, per-pattern results in JSON/NDJSON (mutationPatterns.results[]):

matches[] - all private substitutions matching the pattern, with local reference context and motif match positions
eventTypeCounts[] - per-substitution-type breakdown (e.g., A>G: 8, T>C: 6)
clusters[] - dense regions where matched events exceed the cutoff within the sliding window, with per-cluster event lists and type counts
counts - summary: total matches, clustered events, cluster count

Example JSON output (one pattern, one cluster with two events):

"mutationPatterns": {
  "results": [{
    "id": "adar",
    "name": "ADAR-like RNA editing",
    "description": "A-to-I editing observed as A>G and T>C on complementary strand",
    "counts": { "matches": 14, "clustered": 6, "clusters": 1 },
    "eventTypeCounts": [
      { "type": "nucSubstitution", "refNuc": "A", "qryNuc": "G", "count": 8 },
      { "type": "nucSubstitution", "refNuc": "T", "qryNuc": "C", "count": 6 }
    ],
    "clusters": [{
      "start": 5003, "end": 5033, "count": 6,
      "events": [
        { "type": "nucSubstitution", "pos": 5003, "refNuc": "A", "qryNuc": "G",
          "refContext": ["A", "A", "G"], "motifMatches": [] },
        { "type": "nucSubstitution", "pos": 5010, "refNuc": "T", "qryNuc": "C",
          "refContext": ["G", "T", "A"], "motifMatches": [] }
      ],
      "eventTypeCounts": [
        { "type": "nucSubstitution", "refNuc": "A", "qryNuc": "G", "count": 4 },
        { "type": "nucSubstitution", "refNuc": "T", "qryNuc": "C", "count": 2 }
      ]
    }],
    "matches": ["... (14 matched events, omitted for brevity)"]
  }]
}

TSV/CSV uses per-pattern dynamic columns following the founderMuts['name'].field convention. Each configured pattern produces 4 columns with summary data suitable for spreadsheets and simple scripts. Exhaustive per-event and per-cluster detail is available in JSON/NDJSON output.

Example with one pattern (adar):

mutationPatterns['adar'].counts.matches     14
mutationPatterns['adar'].counts.clustered   6
mutationPatterns['adar'].counts.clusters    1
mutationPatterns['adar'].eventTypeCounts    nucSubstitution:A>G:8,nucSubstitution:T>C:6

With two patterns (adar + apobec), 8 columns appear (4 per pattern):

mutationPatterns['adar'].counts.matches       14
mutationPatterns['adar'].counts.clustered     6
mutationPatterns['adar'].counts.clusters      1
mutationPatterns['adar'].eventTypeCounts      nucSubstitution:A>G:8,nucSubstitution:T>C:6
mutationPatterns['apobec'].counts.matches     8
mutationPatterns['apobec'].counts.clustered   3
mutationPatterns['apobec'].counts.clusters    1
mutationPatterns['apobec'].eventTypeCounts    nucSubstitution:C>T:8

The mutation patterns column group has an independent toggle (includeMutPatterns) in the web export column config UI.

Web visualization

QC column tooltip lists clusters (TODO: QC tooltip text need to be structured, not plain text)
Mutation column tooltip shows pattern-grouped clusters with nucleotide mutation badges
Orange-bordered cluster markers in the sequence view (configurable height: Off/Top/Bottom/Full)

Work items

Add mutation pattern config types (MutationPatternsConfig, MutationPatternEvent::NucSubstitution) and analysis engine (analyze_mutation_patterns()) with event matching, regex motif compilation, sliding-window clustering, and type counting
Add NucSubWithContext pairing each substitution with its local reference genome context
Refactor rule_snp_clusters() to consume pre-computed &[MutationPatternCluster] instead of recomputing clusters internally
Wire analysis into the pipeline between private mutation calling and QC scoring; add mutationPatterns to JSON/NDJSON output and CSV/TSV columns for pattern metadata, match counts, and cluster details
Add SequenceMarkerCluster component and MutationPatternsSection tooltip with pattern-grouped cluster display
Document mutationPatterns config in pathogen.json docs and new column definitions in TSV/CSV docs; regenerate JSON schemas
Add unit tests covering context extraction, type counting, cluster detection, type/motif filtering, IUPAC matching, legacy config fallback, config precedence, QC scoring, and output shape
Redesign TSV output to use per-pattern dynamic columns (mutationPatterns['id'].field) following the founderMuts convention, with includeMutPatterns toggle in web UI

- Tagged union event model (MutationPatternEvent::NucSubstitution) allows adding event types without changing output structure - Regex motifs matched against reference sequence replace fixed trinucleotide IUPAC context, supporting arbitrary-length patterns - Context uses global reference genome, matching mutational signature conventions where the enzyme acts on the physical genome

- QC scoring uses pre-computed unfiltered clusters, independent of pattern-specific type filtering - window_size and cluster_cut_off on QcRulesConfigSnpClusters become serde-defaulted for backward compatibility with datasets omitting them

- Wire analyze_mutation_patterns() between private mutation calling and QC scoring - Add mutationPatterns to JSON/NDJSON output and 9 CSV/TSV columns with pattern-delimited multi-pattern support - Fix inclusive-to-half-open range conversion in cluster range formatters

- Cluster markers rendered as orange-bordered SVG rectangles with configurable height (Off/Top/Bottom/Full) - Mutation column tooltip groups events by configured pattern, showing cluster ranges and nucleotide badges

- Add mutationPatterns section to pathogen.json docs with ADAR and APOBEC examples - Add mutationPatterns.* column definitions to TSV/CSV output docs - Regenerate JSON/YAML schemas with schemars example annotations on all mutation pattern types

github-actions · 2026-05-26T10:34:40Z

Preview: https://nextstrain--nextclade--pr-1767.previews.neherlab.click

(ci)

…utput - Replace pipe-delimited multi-pattern cells with per-entity columns following the founderMuts['name'].field convention - Column names: mutationPatterns['id'].counts.{matches,clustered,clusters}, mutationPatterns['id'].eventTypeCounts - Add include_mut_patterns flag to CsvColumnConfig for independent toggling - Remove MutPatterns from CSV_COLUMN_CONFIG_MAP_DEFAULT (columns now dynamic) - Remove dead code: format_mutation_patterns_field, format_mutation_clusters_ranges, format_mutation_clusters_events, format_mutation_pattern_event, PATTERN_DELIMITER - Add mutation_pattern_keys to AnalysisInitialData - Propagate mutation_pattern_keys through CSV/XLSX/WASM call chains - Add mutation patterns checkbox to web export column config UI - Update TypeScript types for CsvColumnConfig and AnalysisInitialData

- Test mut_pattern_cols() helper with single and special-character pattern IDs - Test prepare_headers() with 0, 1, 2 patterns and with patterns disabled - Verify columns not emitted when include_mut_patterns is false - Verify column ordering follows pattern key order from config

jameshadfield · 2026-05-26T21:36:20Z

Richard brainstormed this idea last week and here it is - awesome work @ivan-aksamentov ⭐

rneher · 2026-05-28T17:10:25Z

Very cool! I think this will be pretty useful. My only thought at the moment is that we typically evaluate these SNP cluster on private mutations. But for the sequence context you implementation picks the reference context (I think). This would be kind of hard to change I imagine since the local context is only known as a list of differences to the reference. For viruses for which we currently think this would be useful, this isn't an issue though.

ivan-aksamentov added 5 commits May 26, 2026 12:01

feat(web): add mutation cluster markers and pattern tooltips

85d2ff7

- Cluster markers rendered as orange-bordered SVG rectangles with configurable height (Off/Top/Bottom/Full) - Mutation column tooltip groups events by configured pattern, showing cluster ranges and nucleotide badges

ivan-aksamentov mentioned this pull request May 26, 2026

feat: add example ebolavirus mutation pattern configs nextstrain/nextclade_data#456

Draft

3 tasks

ivan-aksamentov added 4 commits May 26, 2026 13:22

refactor: lint

6d60933

fix: remove generated _SchemaRoot.d.ts from tracking

e2391ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add mutation pattern analysis#1767

feat: add mutation pattern analysis#1767
ivan-aksamentov wants to merge 9 commits into
masterfrom
feat/mutation-pattern-analysis

ivan-aksamentov commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

jameshadfield commented May 26, 2026

Uh oh!

rneher commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivan-aksamentov commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New configuration

Analysis output

Web visualization

Work items

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

jameshadfield commented May 26, 2026

Uh oh!

rneher commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivan-aksamentov commented May 26, 2026 •

edited

Loading