Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Additional info during ClinVar parsing #83

Open
olingerc opened this issue Jun 28, 2022 · 4 comments
Open

[Feature Request] Additional info during ClinVar parsing #83

olingerc opened this issue Jun 28, 2022 · 4 comments
Assignees

Comments

@olingerc
Copy link

Dear Nirvana team,

I'm sorry to mis-use the issue tracker for a feature request. I was not sure on how to best approach you.

Thanks for the detailed information on how you compile the ClinVar entries (HERE). Quite often we have the situation were we have many Clinvar entries on a position. Even reducing to isAlleleSepcific, it is sometimes difficult to get a good understanding on which entries are relevant to our variant.
Specifically in the context of Clinvar entries that relate to variants at multiple sites (meaning they make only sense in case multiple variants are present = Haplotype). This information is stored in the Measure and GenotypeSet Fields. Would it be possible to at least include Measure? The example below from your documentation displays "single nucleotide variant" but we would be interested to identify cases for which this value would be "Haplotype" or "Genotype". Like this we could remove VCVs if they only make sense in case all variants are present.

<GenotypeSet Type="CompoundHeterozygote" ID="424709">
   <MeasureSet Type="Variant" ID="81">
       <Measure Type="single nucleotide variant" ID="15120">
        <SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
          AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
          stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
          positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
        <SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
          AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
          stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
          positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
       </Measure>
   </MeasureSet>
 </GenotypeSet>
@MichaelStromberg
Copy link
Collaborator

Thanks Christophe! I brought this up with the team during this morning's stand-up meeting. We'll investigate how this is represented in the XML file so that we can provide useful haplotype information.

@rajatshuvro
Copy link
Collaborator

Hi @olingerc ,
Can you point me to a ClinVar record (RCV) that says 'Haplotype' or 'Genotype' in Measure?

It will be really helpful if you can describe a set of RCVs that are connected via this mechanism and which fields indicate the inter-relationship and how in more details. In short, I am asking for a description of your use case with real examples so that we better understand the feature you are requesting.

Thanks.

@olingerc
Copy link
Author

olingerc commented Jun 29, 2022

Hi @rajatshuvro,

An example variant would be: 1-171076966-G-A

Nirvana gives me the following ClinVar list (v3.18.1)
image

There are a total of 3 different (alleleSpecific) VCVs:

  • VCV000038394 classified as benign
  • VCV000016318 classified as pathogenic
  • VCV000217371 classified as pathogenic

However, when opening the ClinVar pages of the two pathogenic variants: here and here it is obvious that they are only pathogenic in case they are coupled with another variant (Haplotype).

It would be very helpful if we had the "Haplotype" Info. It is stored in the MeasureSet element.

<MeasureSet Type="Haplotype" ID="217371" Acc="VCV000217371" Version="1">
</MeasureSet>

(extracted from the full xml). If I read your code correctly you almost read the info already here

Here are all possible values:

    <xs:simpleType name="Measuresettypelist">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Gene"/>
            <xs:enumeration value="Variant"/>
            <xs:enumeration value="Haplotype"/>
            <xs:enumeration value="Phase unknown"/>
            <xs:enumeration value="Distinct chromosomes"/>
        </xs:restriction>
    </xs:simpleType>

A bonus would be having the info which other variant is in the haplotype. A quick fix would be extracting the title:

<ClinVarResult-Set>
   <ClinVarSet ID="101183654">
      <RecordStatus>current</RecordStatus>
         <Title>
            NM_006894.4(FMO3):c.[472G>A;560T>C] AND Trimethylaminuria
         </Title>
         <ReferenceClinVarAssertion ID="477812" DateLastUpdated="2022-06-24" DateCreated="2015-10-30">
...

within brackets, we see the identification of the second variant. Having the full list of variants would of course be nice as well, but I guess this would mean more changes to your code.

Thanks for considering the request!

Here is the corresponding line from a vcf file:

chr1	171076966	.	G	A	128.49	PASS	AC=2;AF=0.333;AN=6;DP=116;FS=4.083;MQ=250;MQRankSum=6.805;QD=1.4;ReadPosRankSum=3.267;SOR=0.346	GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DN	0/0:28,0:0:24:63:PASS:.:.:0,63,945:.:0,74,260:.	0/1:13,16:0.552:29:48:PASS:7,8:6,8:85,0,49:50,6.9375e-05,52.227:128,0,54:.	0/1:33,30:0.476:63:48:PASS:14,12:19,18:84,0,50:49.643,6.8857e-05,53:84,0,124:Inherited

@rajatshuvro
Copy link
Collaborator

Thanks @olingerc . We are actively considering this a an upcoming feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants