-
Notifications
You must be signed in to change notification settings - Fork 5
Data generator
To generate samples with random variants generated based on dbNSFP database values you can use provided Spark job.
There has to be a table with variants and its frequencies present in hive.(We used the dbNSFP database to generate one) with following columns:
Usage: spark-submit <spark-options> pl.edu.pw.elka.GenerateDataJob -a [AnnotationTable] -v [VariantTable]
-d <CountryDictionaryPath> -s <SamplesNumber> -o <OutputPath>
-a, --annotations-table <arg> Name of table in hive that contains dbNSFP
annotations. (default = ANNOTATIONS)
-d, --dict-path <arg> Path to dictionary of countries with their
population for generating samples.
-o, --output-path <arg> Where to store the ocr files with generated
variants.
-s, --samples-number <arg> Number of samples to generate.
-v, --variants-table <arg> Name of table in hive that contains variants
which parameters will be used for the generated
ones. (default = VARIANTS)
--help Show help message
First of all for each sample a country of origin is drawn. The countries are divided in regions for which custom allelic frequencies are present in the EXAC database. That is (Africa, Americas, Europa, Finnish, SouthAsian, WestAsian). The probability of choosing a country from each region is the same. After choosing a region, country is being drawn based on countries population.
For each variant from dbNSFP database a genotype is drawn based on proper allelic frequency. Probabilities for different genotypes are as follows:
Genotype | Probability |
---|---|
0/0 | 1 - 2 * af + af ^ 2 |
0/1 | 2 * af * (1 - af) |
1/1 | af ^ 2 |
Then for each variant that is not reference homozygot, allele depth and total depth is drawn as to have same mean as present in ExAC.