Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: vcf updates #30

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Somatic VCF files available on St. Jude Cloud have been generated when possible
* For patient samples where a complete set of variants have been manually curated and reported as part of an existing publication, the published set of variants was considered the complete set of variants for that patient tumor sample, and used to populate the associated somatic VCF file. This includes samples belonging to the Clinical Pilot cohort study and some of the PCGP cohort studies.
* For currently unpublished patient tumor samples, we filtered variants based on metrics derived from the in-house post-processing pipeline and the input of analysts [(see below)](#sample-variant-validation-and-filtering). This includes the patient samples in the G4K study, along with some unpublished PCGP and Clinical Pilot cohort study samples.
9. Variants were converted from our in-house formats to VCF format.
10. Variant coordinates were lifted over to GRCh38_no_alt using [Picard][picard] `LiftoverVcf`.
10. Variant coordinates were lifted over to GRCh38\_no\_alt using [Picard][picard] `LiftoverVcf`.
11. Variants were normalized using [`vt normalize`][vt].
12. Variants were annotated using [VEP v100][vep] and the `--everything` flag.
13. VCFs were compressed with bgzip and tabix indexed, then validated using [VCFtools' `vcf-validator`][vcf-validator].
Expand All @@ -56,7 +56,7 @@ The validation_status field indicates the status of validating the presence and

For validation by a secondary method, "valid" indicates the variant was validated by the secondary method, "putative" indicates that the secondary method could neither confirm nor invalidate the variant (e.g. due to lack of read coverage by the secondary method), and "invalid" indicates a negative result from the secondary method.

Cross-validation of variant data across multiple sequencing methodologies (WGS, WES, RNA-Seq etc) has been used across clinical genomics studies (Clinical Pilot, G4K, and RTCG). Here SNVs and indels are first manually reviewed and called as "Good" or "Bad" for each sequencing methodology. To make this assessment, the analyst considers 1) computationally generated quality metrics and tags assigned to the variant based on read counts, base qualities, and realignment; and 2) manual inspection of tumor and germline read evidence within sequencing data from the given sequencing platform. The assigned validation status is determined based on the observed support for the variant across each of the sequencing methodologies as follows:
Cross-validation of variant data across multiple sequencing methodologies (WGS, WES, RNA-Seq, etc.) has been used across clinical genomics studies (Clinical Pilot, G4K, and RTCG). Here SNVs and indels are first manually reviewed and called as "Good" or "Bad" for each sequencing methodology. To make this assessment, the analyst considers 1) computationally generated quality metrics and tags assigned to the variant based on read counts, base qualities, and realignment; and 2) manual inspection of tumor and germline read evidence within sequencing data from the given sequencing platform. The assigned validation status is determined based on the observed support for the variant across each of the sequencing methodologies as follows:

##### Fresh/Frozen Samples

Expand Down Expand Up @@ -86,7 +86,9 @@ Where possible, allele counts and read depths for each variant are determined ac

#### Processing tools

The newly created hg19 coordinate VCFs are lifted over using [Picard `LiftoverVcf`][picard]. [`hg19ToHg38.over.chain`][chain] is used as the chain file and `GRCh38_no_alt.fa` is used as the reference. The hg38 build includes some reference contigs not present in the GRCh38_no_alt build, so variants which would map to one of those alternate sequences are excluded from the VCFs. Variants which fail Picard liftover may be reviewed by an analyst and lifted over manually. [`bcftools annotate`][bcftools] is used to document the reference file and exact liftover command used in the VCF's header. Note that hg19 alleles and coordinates are stored by Picard in the `OriginalAlleles`, `OriginalContig`, and `OriginalStart` info tags within the final VCF file. Please also note that there is a bug in the version of Picard we use (version 2.18.29) which rarely truncates some of the VCF's format and sample fields. These are fields which we use to store read counts and depths for each sequencing type (WGS, WES, RNA-Seq, VALCAP). A [bug report][bug_report] has been filed with Picard. In the current version of the pipeline we recover the dropped entries from the original hg19 VCF. While searching for and correcting these entries, we also reorder the FORMAT and genotype fields to a logical ordering, as opposed to the alphabetical output order of Picard.
The newly created hg19 coordinate VCFs are lifted over using [Picard `LiftoverVcf`][picard]. [`hg19ToHg38.over.chain`][chain] is used as the chain file and `GRCh38_no_alt.fa` is used as the reference. The hg38 build includes some reference contigs not present in the GRCh38\_no\_alt build, so variants which would map to one of those alternate sequences are excluded from the VCFs. There are also differences in individual segments of sequences between the builds which can cause a variant to fail liftover. An investigation was performed on variants which failed liftover in the Clinical Pilot cohort. It was found all failures were due to artifacts of the hg19 genome build, and no variants of interest were lost. As such no effort is currently made to review or "rescue" variants which fail liftover. Note that hg19 alleles and coordinates are stored by Picard in the `OriginalAlleles`, `OriginalContig`, and `OriginalStart` info tags within the final VCF file. Please also note that there is a bug in the version of Picard we use (version 2.18.29) which rarely truncates some of the VCF's format and sample fields. These are fields which we use to store read counts and depths for each sequencing type (WGS, WES, RNA-Seq, VALCAP). A [bug report][bug_report] has been filed with Picard. In the current version of the pipeline we recover the dropped entries from the original hg19 VCF. While searching for and correcting these entries, we also reorder the FORMAT and genotype fields to a logical ordering, as opposed to the alphabetical output order of Picard.

[`bcftools annotate`][bcftools] is used to document the reference file and exact liftover command used in the VCF's header.

`vt normalize` is used to obtain consistent representations of all variants which may have multiple equivalent forms. If VT modifies a variant, it records the original form in the `OLD_VARIANT` info tag. [Click here][vt] to read about variant normalization.

Expand Down