Skip to content

Ensembl patch-scaffold transcripts (HG..._PATCH) missing from data β€” some ClinVar variants don't resolveΒ #113

Description

@davmlaw

πŸ€– Written by Claude

Summary

A small number of Ensembl transcripts are missing from the cdot data because Ensembl annotates them only on patch / haplotype scaffolds (HG..._PATCH) rather than the primary assembly. ClinVar/ClinGen still emits c.HGVS referencing these transcript accessions (mapped to primary-chromosome genomic coordinates), so those variants fail to resolve.

How it was found

Running the ClinVar test corpus (tests/test_data/clinvar_hgvs/clinvar_hgvs_ensembl.tsv) through c_to_g against the REST data provider (cdotlib.org, batch prefetch + local GRCh38 FASTA):

  • RefSeq: 558/558 resolved (100%)
  • Ensembl: 816/818 resolved (99.8%) β€” 2 failed with No alignments for <ac> in GRCh38 using splign

Both missing accessions 404 on https://cdotlib.org/transcript/<ac> (including the versionless form), i.e. they are genuinely absent from the deployed Ensembl data.

The two transcripts

Transcript Gene Ensembl scaffold (GRCh38) Example HGVS (from ClinVar) Expected genomic
ENST00000642844.3 SLC37A4 HG2217_PATCH ENST00000642844.3:c.1167G>A NC_000011.10:g.119025033C>T
ENST00000643490.2 POLR2A HG2046_PATCH ENST00000643490.2:c.3371T>C NC_000017.11:g.7508381T>C

(Confirmed via the Ensembl REST API lookup/id β€” both are protein-coding transcripts whose seq_region_name is a HG..._PATCH scaffold.)

Likely root cause

cdot's Ensembl GFF processing includes transcripts aligned to the primary assembly only; transcripts that Ensembl places solely on patch/haplotype scaffolds are dropped. The genes themselves are on the main chromosomes (SLC37A4 β†’ chr11, POLR2A β†’ chr17), so the variants do have valid primary-assembly positions β€” but the specific transcript accession isn't in the data to drive the mapping.

Possible actions

  • Decide whether to include patch-scaffold Ensembl transcripts in the data (and how to align them to the primary assembly), or
  • Document this as a known limitation.

Impact is small (2/818 β‰ˆ 0.2% of the Ensembl ClinVar test set), but flagging so it's a conscious decision. Data-coverage issue only; no client-code bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions