feat: add mutation pattern analysis#1767
Open
ivan-aksamentov wants to merge 9 commits into
Open
Conversation
- Tagged union event model (MutationPatternEvent::NucSubstitution) allows adding event types without changing output structure - Regex motifs matched against reference sequence replace fixed trinucleotide IUPAC context, supporting arbitrary-length patterns - Context uses global reference genome, matching mutational signature conventions where the enzyme acts on the physical genome
- QC scoring uses pre-computed unfiltered clusters, independent of pattern-specific type filtering - window_size and cluster_cut_off on QcRulesConfigSnpClusters become serde-defaulted for backward compatibility with datasets omitting them
- Wire analyze_mutation_patterns() between private mutation calling and QC scoring - Add mutationPatterns to JSON/NDJSON output and 9 CSV/TSV columns with pattern-delimited multi-pattern support - Fix inclusive-to-half-open range conversion in cluster range formatters
- Cluster markers rendered as orange-bordered SVG rectangles with configurable height (Off/Top/Bottom/Full) - Mutation column tooltip groups events by configured pattern, showing cluster ranges and nucleotide badges
- Add mutationPatterns section to pathogen.json docs with ADAR and APOBEC examples - Add mutationPatterns.* column definitions to TSV/CSV output docs - Regenerate JSON/YAML schemas with schemars example annotations on all mutation pattern types
3 tasks
…utput
- Replace pipe-delimited multi-pattern cells with per-entity columns following the founderMuts['name'].field convention
- Column names: mutationPatterns['id'].counts.{matches,clustered,clusters}, mutationPatterns['id'].eventTypeCounts
- Add include_mut_patterns flag to CsvColumnConfig for independent toggling
- Remove MutPatterns from CSV_COLUMN_CONFIG_MAP_DEFAULT (columns now dynamic)
- Remove dead code: format_mutation_patterns_field, format_mutation_clusters_ranges, format_mutation_clusters_events, format_mutation_pattern_event, PATTERN_DELIMITER
- Add mutation_pattern_keys to AnalysisInitialData
- Propagate mutation_pattern_keys through CSV/XLSX/WASM call chains
- Add mutation patterns checkbox to web export column config UI
- Update TypeScript types for CsvColumnConfig and AnalysisInitialData
- Test mut_pattern_cols() helper with single and special-character pattern IDs - Test prepare_headers() with 0, 1, 2 patterns and with patterns disabled - Verify columns not emitted when include_mut_patterns is false - Verify column ordering follows pattern key order from config
Member
|
Richard brainstormed this idea last week and here it is - awesome work @ivan-aksamentov ⭐ |
Member
|
Very cool! I think this will be pretty useful. My only thought at the moment is that we typically evaluate these SNP cluster on private mutations. But for the sequence context you implementation picks the reference context (I think). This would be kind of hard to change I imagine since the local context is only known as a list of differences to the reference. For viruses for which we currently think this would be useful, this isn't an issue though. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Test web: https://nextstrain--nextclade--pr-1767.previews.neherlab.click/?dataset-server=gh:@feat/mutation-pattern-analysis@&dataset-name=nextstrain/orthoebolavirus/bdbv&input-fasta=example
Nextclade's
qc.snpClustersrule detects dense private substitution clusters as a quality signal, but treats all substitution types uniformly. Biological editing enzymes - ADAR (adenosine deaminase acting on RNA) and APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like) - produce directional substitution clusters that are expected in some viral lineages and should not inflate QC penalty scores.This PR adds a configurable mutation pattern analysis step. Dataset authors define named patterns in
pathogen.json, each selecting private substitutions by reference/query nucleotide and optional reference-context regex motifs. The analysis produces per-pattern match counts, per-type breakdowns, and pattern-local clusters independent from the global QC rule. Existing datasets withoutmutationPatternsconfig are unaffected - the QC rule falls back toqc.snpClustersparameters as before.The event model uses tagged unions (
"type": "nucSubstitution") so future event types (amino acid changes, insertions) can be added without breaking the output structure. Trinucleotide context is defined against the global reference sequence rather than the nearest tree node, matching mutational signature conventions where the enzyme acts on the physical genome.New configuration
Add
mutationPatternstopathogen.json:Each pattern contains:
id,name,description- identification and display metadataevents[]- filters selecting which private substitutions match this pattern. Each event specifiesrefandqrynucleotide arrays (IUPAC ambiguity codes supported) and optionalmotifsarray of regex patterns matched against the reference sequence at the substitution sitecluster(optional) - pattern-local sliding window clustering withwindowSize(nucleotides) andcutoff(minimum events per window). Independent fromqc.snpClustersAnalysis output
Per-sequence, per-pattern results in JSON/NDJSON (
mutationPatterns.results[]):matches[]- all private substitutions matching the pattern, with local reference context and motif match positionseventTypeCounts[]- per-substitution-type breakdown (e.g., A>G: 8, T>C: 6)clusters[]- dense regions where matched events exceed the cutoff within the sliding window, with per-cluster event lists and type countscounts- summary: total matches, clustered events, cluster countExample JSON output (one pattern, one cluster with two events):
TSV/CSV uses per-pattern dynamic columns following the
founderMuts['name'].fieldconvention. Each configured pattern produces 4 columns with summary data suitable for spreadsheets and simple scripts. Exhaustive per-event and per-cluster detail is available in JSON/NDJSON output.Example with one pattern (
adar):With two patterns (
adar+apobec), 8 columns appear (4 per pattern):The mutation patterns column group has an independent toggle (
includeMutPatterns) in the web export column config UI.Web visualization
Work items
MutationPatternsConfig,MutationPatternEvent::NucSubstitution) and analysis engine (analyze_mutation_patterns()) with event matching, regex motif compilation, sliding-window clustering, and type countingNucSubWithContextpairing each substitution with its local reference genome contextrule_snp_clusters()to consume pre-computed&[MutationPatternCluster]instead of recomputing clusters internallymutationPatternsto JSON/NDJSON output and CSV/TSV columns for pattern metadata, match counts, and cluster detailsSequenceMarkerClustercomponent andMutationPatternsSectiontooltip with pattern-grouped cluster displaymutationPatternsconfig in pathogen.json docs and new column definitions in TSV/CSV docs; regenerate JSON schemasmutationPatterns['id'].field) following thefounderMutsconvention, withincludeMutPatternstoggle in web UI