Skip to content

Data generator

Dawid Wysakowicz edited this page Jan 16, 2016 · 6 revisions

To generate samples with random variants generated based on dbNSFP database values you can use provided Spark job.

Prequisites

There has to be a table with variants and its frequencies present in hive.(We used the dbNSFP database to generate one) with following columns:

Usage

Usage: spark-submit <spark-options> pl.edu.pw.elka.GenerateDataJob -a [AnnotationTable] -v [VariantTable] 
                    -d <CountryDictionaryPath> -s <SamplesNumber> -o <OutputPath>

        -a, --annotations-table  <arg>   Name of table in hive that contains dbNSFP
                                         annotations. (default = ANNOTATIONS)
        -d, --dict-path  <arg>           Path to dictionary of countries with their
                                         population for generating samples.
        -o, --output-path  <arg>         Where to store the ocr files with generated
                                         variants.
        -s, --samples-number  <arg>      Number of samples to generate.
        -v, --variants-table  <arg>      Name of table in hive that contains variants
                                         which parameters will be used for the generated
                                         ones. (default = VARIANTS)

        --help                           Show help message

Algorithm

Sample origin

First of all for each sample a country of origin is drawn. The countries are divided in regions for which custom allelic frequencies are present in the EXAC database. That is (Africa, Americas, Europa, Finnish, SouthAsian, WestAsian). The probability of choosing a country from each region is the same. After choosing a region, country is being drawn based on countries population.

Variants

For each variant from dbNSFP database a genotype is drawn based on proper allelic frequency. Probabilities for different genotypes are as follows:

Genotype Probability
0/0 1 - 2 * af + af ^ 2
0/1 2 * af * (1 - af)
1/1 af ^ 2

Then for each variant that is not reference homozygot, allele depth and total depth is drawn as to have same mean as present in ExAC.

Clone this wiki locally