π€ Written by Claude
Summary
A small number of Ensembl transcripts are missing from the cdot data because Ensembl annotates them only on patch / haplotype scaffolds (HG..._PATCH) rather than the primary assembly. ClinVar/ClinGen still emits c.HGVS referencing these transcript accessions (mapped to primary-chromosome genomic coordinates), so those variants fail to resolve.
How it was found
Running the ClinVar test corpus (tests/test_data/clinvar_hgvs/clinvar_hgvs_ensembl.tsv) through c_to_g against the REST data provider (cdotlib.org, batch prefetch + local GRCh38 FASTA):
- RefSeq: 558/558 resolved (100%)
- Ensembl: 816/818 resolved (99.8%) β 2 failed with
No alignments for <ac> in GRCh38 using splign
Both missing accessions 404 on https://cdotlib.org/transcript/<ac> (including the versionless form), i.e. they are genuinely absent from the deployed Ensembl data.
The two transcripts
| Transcript |
Gene |
Ensembl scaffold (GRCh38) |
Example HGVS (from ClinVar) |
Expected genomic |
ENST00000642844.3 |
SLC37A4 |
HG2217_PATCH |
ENST00000642844.3:c.1167G>A |
NC_000011.10:g.119025033C>T |
ENST00000643490.2 |
POLR2A |
HG2046_PATCH |
ENST00000643490.2:c.3371T>C |
NC_000017.11:g.7508381T>C |
(Confirmed via the Ensembl REST API lookup/id β both are protein-coding transcripts whose seq_region_name is a HG..._PATCH scaffold.)
Likely root cause
cdot's Ensembl GFF processing includes transcripts aligned to the primary assembly only; transcripts that Ensembl places solely on patch/haplotype scaffolds are dropped. The genes themselves are on the main chromosomes (SLC37A4 β chr11, POLR2A β chr17), so the variants do have valid primary-assembly positions β but the specific transcript accession isn't in the data to drive the mapping.
Possible actions
- Decide whether to include patch-scaffold Ensembl transcripts in the data (and how to align them to the primary assembly), or
- Document this as a known limitation.
Impact is small (2/818 β 0.2% of the Ensembl ClinVar test set), but flagging so it's a conscious decision. Data-coverage issue only; no client-code bug.
π€ Written by Claude
Summary
A small number of Ensembl transcripts are missing from the cdot data because Ensembl annotates them only on patch / haplotype scaffolds (
HG..._PATCH) rather than the primary assembly. ClinVar/ClinGen still emitsc.HGVSreferencing these transcript accessions (mapped to primary-chromosome genomic coordinates), so those variants fail to resolve.How it was found
Running the ClinVar test corpus (
tests/test_data/clinvar_hgvs/clinvar_hgvs_ensembl.tsv) throughc_to_gagainst the REST data provider (cdotlib.org, batch prefetch + local GRCh38 FASTA):No alignments for <ac> in GRCh38 using splignBoth missing accessions 404 on
https://cdotlib.org/transcript/<ac>(including the versionless form), i.e. they are genuinely absent from the deployed Ensembl data.The two transcripts
ENST00000642844.3HG2217_PATCHENST00000642844.3:c.1167G>ANC_000011.10:g.119025033C>TENST00000643490.2HG2046_PATCHENST00000643490.2:c.3371T>CNC_000017.11:g.7508381T>C(Confirmed via the Ensembl REST API
lookup/idβ both are protein-coding transcripts whoseseq_region_nameis aHG..._PATCHscaffold.)Likely root cause
cdot's Ensembl GFF processing includes transcripts aligned to the primary assembly only; transcripts that Ensembl places solely on patch/haplotype scaffolds are dropped. The genes themselves are on the main chromosomes (SLC37A4 β chr11, POLR2A β chr17), so the variants do have valid primary-assembly positions β but the specific transcript accession isn't in the data to drive the mapping.
Possible actions
Impact is small (2/818 β 0.2% of the Ensembl ClinVar test set), but flagging so it's a conscious decision. Data-coverage issue only; no client-code bug.