Background
MaveDB currently stores ClinVar associations in a generic clinical_controls table, originally designed to hold associations from multiple external databases. In practice it only holds ClinVar data, and the original generalized design has become a liability: ClinVar's own identifier model requires two distinct fields — Allele ID and Variation ID — which don't fit cleanly into a generic structure.
Additionally, ClinVar has been converging on Variation ID as the canonical public identifier (anchoring their web UI, VCF files, and search), while Allele ID remains the correct handle for allele-level cross-references like gnomAD. MaveDB currently only stores Allele ID, meaning our external ClinVar links are using a secondary identifier.
Proposed Change
Replace clinical_controls with a dedicated clinvar_variants table with explicit fields for both identifiers:
clinvar_allele_id — for allele-level cross-references (gnomAD, etc.)
clinvar_variation_id — for external ClinVar links (clinvar/variation/{id})
Both fields can be populated from the ClinVar TSV, which contains both IDs, so no additional lookups are required. For the simple SNVs and indels MaveDB works with, these are effectively 1-to-1.
This brings ClinVar annotations in line with other MaveDB annotations, which are organized around dedicated, source-specific data structures rather than a generic association table. If additional external database associations are added in the future, they should continue to follow this pattern.
Acceptance Criteria
Background
MaveDB currently stores ClinVar associations in a generic
clinical_controlstable, originally designed to hold associations from multiple external databases. In practice it only holds ClinVar data, and the original generalized design has become a liability: ClinVar's own identifier model requires two distinct fields — Allele ID and Variation ID — which don't fit cleanly into a generic structure.Additionally, ClinVar has been converging on Variation ID as the canonical public identifier (anchoring their web UI, VCF files, and search), while Allele ID remains the correct handle for allele-level cross-references like gnomAD. MaveDB currently only stores Allele ID, meaning our external ClinVar links are using a secondary identifier.
Proposed Change
Replace
clinical_controlswith a dedicatedclinvar_variantstable with explicit fields for both identifiers:clinvar_allele_id— for allele-level cross-references (gnomAD, etc.)clinvar_variation_id— for external ClinVar links (clinvar/variation/{id})Both fields can be populated from the ClinVar TSV, which contains both IDs, so no additional lookups are required. For the simple SNVs and indels MaveDB works with, these are effectively 1-to-1.
This brings ClinVar annotations in line with other MaveDB annotations, which are organized around dedicated, source-specific data structures rather than a generic association table. If additional external database associations are added in the future, they should continue to follow this pattern.
Acceptance Criteria
clinical_controlsreplaced withclinvar_variantsin the data model