Investigators can upload variants to MaveDB that are defined on the nucleotide or protein level relative to a target. When mapping to the human genome, nucleotide variants are mapped to their chromosomal location (g. variants) and protein variants are mapped to protein references. For nucleotide variants that were mapped to a chromosomal location, from the g. variant we will also generate a c. variant relative to the MANE select transcript that the target mapped to.
Some MAVEs are performed on the amino acid level, meaning that a score is provided for each possible protein change over the target. Some investigators define these variants at the nucleotide level, and since each codon has been changed to code for each amino acid, some resulting variants are delins or multivariants. The official HGVS rule is that multiple base changes separated by an unchanged base should be represented as a delins only if the base changes are part of the same codon. Since g. variants aren't relative to a transcript, should/can we still apply this rule to the g. variants? Alan has mentioned that conceptually, HGVS strings should represent an "event," so since these variants represent a codon being replaced with a different codon to measure function of a protein variant, would that be an argument to use delins notation on the g. level?
If we don't use delins notation for g. multi-base-variants with an unchanged base in between, but we do use delins notation for the same variant on the c. level, we end up with inconsistency between g. and c. variants (the g. variant will be a multi-variant and the c. variant will be delins), and we also have inconsistency between variants in the same score set (a codon change with two or three changed bases all next to each other will be a delins, while a similar codon change in the same score set with an unchanged second base will be a multi-variant with a semi-colon).
The same question applies to representing these variants with VRS on the genomic/chromosomal level. I believe that the Python VRS generator turns multi-variants separated by semicolons into CisPhasedBlocks. Should a codon change with two changed bases separated by an unchanged base in the middle be represented as a CisPhasedBlock or a single Allele with a delins, when relative to a chromosome rather than a transcript?
Here is an example of a variant that results in this issue:
NM_000314.8:c.142_144delinsGAT
Which on the genomic level could be represented as either:
NC_000010.11:g.[87894087A>G;87894089C>T]
NC_000010.11:g.87894087_87894089delinsGAT
ClinGen provides g. HGVS for this variant using the delins notation (see http://reg.genome.network/allele/CA891835316).
Investigators can upload variants to MaveDB that are defined on the nucleotide or protein level relative to a target. When mapping to the human genome, nucleotide variants are mapped to their chromosomal location (g. variants) and protein variants are mapped to protein references. For nucleotide variants that were mapped to a chromosomal location, from the g. variant we will also generate a c. variant relative to the MANE select transcript that the target mapped to.
Some MAVEs are performed on the amino acid level, meaning that a score is provided for each possible protein change over the target. Some investigators define these variants at the nucleotide level, and since each codon has been changed to code for each amino acid, some resulting variants are delins or multivariants. The official HGVS rule is that multiple base changes separated by an unchanged base should be represented as a delins only if the base changes are part of the same codon. Since g. variants aren't relative to a transcript, should/can we still apply this rule to the g. variants? Alan has mentioned that conceptually, HGVS strings should represent an "event," so since these variants represent a codon being replaced with a different codon to measure function of a protein variant, would that be an argument to use delins notation on the g. level?
If we don't use delins notation for g. multi-base-variants with an unchanged base in between, but we do use delins notation for the same variant on the c. level, we end up with inconsistency between g. and c. variants (the g. variant will be a multi-variant and the c. variant will be delins), and we also have inconsistency between variants in the same score set (a codon change with two or three changed bases all next to each other will be a delins, while a similar codon change in the same score set with an unchanged second base will be a multi-variant with a semi-colon).
The same question applies to representing these variants with VRS on the genomic/chromosomal level. I believe that the Python VRS generator turns multi-variants separated by semicolons into CisPhasedBlocks. Should a codon change with two changed bases separated by an unchanged base in the middle be represented as a CisPhasedBlock or a single Allele with a delins, when relative to a chromosome rather than a transcript?
Here is an example of a variant that results in this issue:
NM_000314.8:c.142_144delinsGAT
Which on the genomic level could be represented as either:
NC_000010.11:g.[87894087A>G;87894089C>T]
NC_000010.11:g.87894087_87894089delinsGAT
ClinGen provides g. HGVS for this variant using the delins notation (see http://reg.genome.network/allele/CA891835316).