-
Notifications
You must be signed in to change notification settings - Fork 15
Variant comparison
To compare two or more set of variant calls, describe the sets of variants to compare in an input YAML configuration file. The config directory contains several example configurations to help get started. Here is an example config that compares two sets of calls (GATK and samtools) for an NA12878 sample, first performing normalization and conversion to GRCh37 coordinates where the GATK input was called against hg19 and samtools was called against a sorted GRCh37 genome:
dir:
base: /base/directory
out: work
prep: work/prep
experiments:
- sample: NA12878
ref: /path/to/GRCh37.fa
calls:
- name: gatk
file: gatk.vcf
ref: /path/to/hg19.fa
prep: true
- name: samtools
file: samtools.vcf
Run with:
lein variant-compare /path/to/your/NA12878-compare.yaml
Which produces output files in /base/directory/work
. The files of interest include:
-
NA12878-compare-summary.txt
: Summary table of concordance and discordance comparisons. -
NA12878-gatk-samtools-concordance.vcf
: Concordant variants from both GATK and samtools. -
NA12878-gatk-samtools-discordance.vcf
: Discordant variants called in GATK and not called, or differently called, in samtools. -
NA12878-samtools-gatk-discordance.vcf
: Discordant variants called in samtools and not called, or differently called, in GATK.
The summary file contains a table of concordance and discordance stats for each pairwise comparison:
|--------------------------------+--------|
| Metric | Value |
|--------------------------------+--------|
| Overall genotype concordance | 99.70 |
| Callable genotype concordance | 99.70 |
| Non-reference discrepancy rate | 0.30 |
| Non-reference sensitivity | 99.88 |
| concordant: total | 186533 |
| concordant: non-reference | 186533 |
| concordant: SNPs | 171484 |
| concordant: indels | 15049 |
| illumina-discordant: total | 18710 |
| illumina-discordant: SNPs | 10560 |
| illumina-discordant: indels | 8150 |
| cg-discordant: total | 9054 |
| cg-discordant: unique | 42 |
| cg-discordant: SNPs | 6358 |
| cg-discordant: indels | 2667 |
| Shared discordant | 1041 |
| het/hom discordant | 749 |
|--------------------------------+--------|
-
The first 4 metrics: concordance, callable concordance non-reference discrepancy rate and non-reference sensitivity provide measures of how well the variants match. GATK's VariantEval documentation provides a great description of each. Callable concordance evaluates only those regions callable, according to GATK's CallableLoci walker.
-
The next 4 lines provide counts for concordant variants, broken down into SNPs and indels.
-
The next 7 lines describe discordant variant counts. We split these into discordant variants called in each method in the comparison (illumina and cg, in this example). We further subdivide them by SNPs and indels. If the configuration contains BAM files, we also calculate the unique discordants found only in that method and not callable in the other method.
-
The final two lines identify variants shared between the two methods. These are variants called in both, but that are different either because of the variant called (different indels, for example) or because of the zygosity. The final line identifies the subset of these that are due to heterozygote/homozygote variant differences.
For example, a shared SNP difference could happen if the reference materials had a homozygote call and the reference had a heterozygote call:
chr1 1234 . A T . ref-call GT 1/1 chr1 1234 . A T . eval-call GT 0/1
and a shared indel with different overlapping regions:
chr1 1234 . A ATGC . ref-call GT 0/1 chr1 1234 . A ATGCGC . eval-call GT 0/1