Skip to content
chapmanb edited this page Nov 13, 2014 · 7 revisions

To compare two or more set of variant calls, describe the sets of variants to compare in an input YAML configuration file. The config directory contains several example configurations to help get started. Here is an example config that compares two sets of calls (GATK and samtools) for an NA12878 sample, first performing normalization and conversion to GRCh37 coordinates where the GATK input was called against hg19 and samtools was called against a sorted GRCh37 genome:

dir:
  base: /base/directory
  out: work
  prep: work/prep
experiments:
  - sample: NA12878
    ref: /path/to/GRCh37.fa
    calls:
      - name: gatk
        file: gatk.vcf
        ref: /path/to/hg19.fa
        prep: true
      - name: samtools
        file: samtools.vcf

Run with:

lein variant-compare /path/to/your/NA12878-compare.yaml

Which produces output files in /base/directory/work. The files of interest include:

  • NA12878-compare-summary.txt: Summary table of concordance and discordance comparisons.
  • NA12878-gatk-samtools-concordance.vcf: Concordant variants from both GATK and samtools.
  • NA12878-gatk-samtools-discordance.vcf: Discordant variants called in GATK and not called, or differently called, in samtools.
  • NA12878-samtools-gatk-discordance.vcf: Discordant variants called in samtools and not called, or differently called, in GATK.

Metrics table

The summary file contains a table of concordance and discordance stats for each pairwise comparison:

|--------------------------------+--------|
|             Metric             |  Value |
|--------------------------------+--------|
| Overall genotype concordance   | 99.70  |
| Callable genotype concordance  | 99.70  |
| Non-reference discrepancy rate | 0.30   |
| Non-reference sensitivity      | 99.88  |
| concordant: total              | 186533 |
| concordant: non-reference      | 186533 |
| concordant: SNPs               | 171484 |
| concordant: indels             | 15049  |
| illumina-discordant: total     | 18710  |
| illumina-discordant: SNPs      | 10560  |
| illumina-discordant: indels    | 8150   |
| cg-discordant: total           | 9054   |
| cg-discordant: unique          | 42     |
| cg-discordant: SNPs            | 6358   |
| cg-discordant: indels          | 2667   |
| Shared discordant              | 1041   |
| het/hom discordant             | 749    |
|--------------------------------+--------|
  • The first 4 metrics: concordance, callable concordance non-reference discrepancy rate and non-reference sensitivity provide measures of how well the variants match. GATK's VariantEval documentation provides a great description of each. Callable concordance evaluates only those regions callable, according to GATK's CallableLoci walker.

  • The next 4 lines provide counts for concordant variants, broken down into SNPs and indels.

  • The next 7 lines describe discordant variant counts. We split these into discordant variants called in each method in the comparison (illumina and cg, in this example). We further subdivide them by SNPs and indels. If the configuration contains BAM files, we also calculate the unique discordants found only in that method and not callable in the other method.

  • The final two lines identify variants shared between the two methods. These are variants called in both, but that are different either because of the variant called (different indels, for example) or because of the zygosity. The final line identifies the subset of these that are due to heterozygote/homozygote variant differences.

    For example, a shared SNP difference could happen if the reference materials had a homozygote call and the reference had a heterozygote call:

    chr1  1234 .  A   T  . ref-call  GT  1/1
    chr1  1234 .  A   T  . eval-call  GT 0/1
    

    and a shared indel with different overlapping regions:

    chr1  1234 .  A   ATGC  .   ref-call  GT  0/1
    chr1  1234 .  A   ATGCGC  . eval-call  GT 0/1
    
Clone this wiki locally