Skip to content

Commit

Permalink
Update 00_intro_to_variant_calling.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Gammerdinger authored Dec 13, 2023
1 parent 18b5c7e commit 8ff4431
Showing 1 changed file with 35 additions and 9 deletions.
44 changes: 35 additions & 9 deletions lessons/00_intro_to_variant_calling.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@ Approximate time: 45 minutes
- Differentiate between somatic and germline variants
- Describe the key steps in the variant calling pipeline

Genomic variants are the basis for many diseases and are the raw material for evolutionary processes. As such, inteerest in analyzing genomic variants is of great interest to clinical physicians, evolutionary biologiests and many more. This workshop will aims to guide participants through the process of:
## Motivations and Goals

1. Calling variants
2. Annotating variants
3. Prioritizing variants
Genomic variants are the basis for many diseases and are the raw material for evolutionary processes. Analyzing genomic variants can help inform clinicians and also help discover novel variants responsible for disease. Evolutionary biologists interpret genomic variants to quantify the evolutionary relationships between species. Conservation biologists use genetic variants to measure diversity amongst endangered populations. As such, there is broad interest in analyzing genomic variants to clinicians and biologists alike. Analyzing variants takes the form of three main steps:

1. **Calling variants** - Identifying loci that different from the reference genome
2. **Annotating variants** - Categorizing the variants through the context of known gene models
3. **Prioritizing variants** - Interpretting the annotated variants based upon the impact of the variant and the gene(s) it resides within

However, before we can get started, we first must talk about the different types of variants that exist and the challenges that researchers face when analyzing them.

Expand All @@ -42,12 +44,12 @@ As you can also presume, high coverage is helpful for two reasons:

There are several different types variants that require their own consideration. These include:

- Single Nucleotide Polymorphisms (SNPs)
- Small Insertions/Deletions (Indels)
- Copy Number Variants (CNVs)
- Structural Variants (SVs)
- **Single Nucleotide Polymorphisms (SNPs)** - These are positions in the genome where a single base has been mutated. For example, perhaps most individuals in a population have a Thymine in a given position, while an individual of interest has a Adenine in this position.
- **Small Insertions/Deletions (Indels)** - Small indels are loci where a few bases have been added or removed relative to the larger population. For example, if an individual has an extra `GA` at a location relative to the rest of the population, then this would be considered a small insertion.
- **Structural Variants (SVs)** - This class refers to a broad collection of variants, including inversions, translocations and large insertions or deletions.
- **Copy Number Variants (CNVs)** - These types of variants often occur in repetitive regions of the genome and involve having more of few copies of a given sequence. For example, the *AMY1* gene which encodes for an enzyme that is important in breaking down starches has been shown to have variable numbers of copies across human populations. Further work has shown that the number of copy numbers correlates with with the levels of starch in various cultures (Perry et al., 2007). Depending on the size, copy number variants are sometimes considered a subcategory of structural variants.

Similarly to the tools in a workshop, variant calling for each of these types of variants requires tools created for it. [`GATK`](https://gatk.broadinstitute.org/hc/en-us) has packages that can address the needs of several of these:
Similarly to the tools in a workshop, variant calling for each of these types of variants requires tools created for it. [`GATK`](https://gatk.broadinstitute.org/hc/en-us) is a popular tool for variant calling that was developed and is maintained by the Broad Institute. `GATK` has packages that can address the needs of several types of variant calling:

- [`HaplotypeCaller`](https://gatk.broadinstitute.org/hc/en-us/articles/5358864757787-HaplotypeCaller) can be used for germline SNPs and Indels
- [`MuTect2`](https://gatk.broadinstitute.org/hc/en-us/articles/5358911630107-Mutect2) can be used for somatic SNPs and Indels
Expand All @@ -56,4 +58,28 @@ Similarly to the tools in a workshop, variant calling for each of these types of

This course is going to focus on analyzing somatic SNPs, so we are going to use `MuTect2`.

## The Importance of Coverage

One of the most important considerations of experimental design when carrying out a study to identify variants is to sequence your samples to an adequate level of coverage. Coverage simply means for a given position, what is the average number of sequencing reads that span (or "cover") that position and it is abbreviated as the integer value followed by "X". For example, if the average position in the genome was covered by 22 reads, this sample would be considered to have 22X. Generally speaking, it is often encouraged for researchers to reach a minimum coverage of 30X for variant calling. However, higher coverage levels can be useful for detecting rare variants, particularly in somatic variant calling.

Given the costs associated with whole genome sequencing (WGS) of each individual to 30X or greater coverage, some researchers opt to simply carry our whole exome sequencing (WES) rather than whole genome sequencing. The Human exome is about ~1% of the human genome's size and many researchers are interested in focusing on transcripts to begin with. Therefore, it can greatly reduce the sequencing costs a researcher might incur if they are willing to forgo the regions not captured in the exome. It should be noted that due to the uneveness of WES, a greater depth is often encouraged, 70-100X.

**Exercise**

Use the figure below to try to make inferences answer the following questions:

<p align="center">
<img src="../img/Difficulty_of_assignment.png" width="600">
</p>

1. If we assume there are no sequencing errors, are you more inclined to speculate that Locus 1 is a germline or somatic variant? Why?
2. Given the existence of sequencing errors, how confident are you that Locus 1 represents a heterozygous locus in the germline?
3. Given the existence of sequencing errors, how confident are you that Locus 1 represents a polymorphic locus in a somatic tissue?
4. How confident are you that Locus 2 is homozygous?
5. What additional information might you want in order to better assess these loci?






0 comments on commit 8ff4431

Please sign in to comment.