-
Notifications
You must be signed in to change notification settings - Fork 5
Data generator
To generate samples with random variants generated based on dbNSFP database values you can use provided Spark job.
In folder ${project_root}/samplesgenerator
run command sbt assembly
Then resulting jar will be present in ${project_root}/samplesgenerator/target/scala-2.10/gate-generate-samples-assembly-1.0.jar
There has to be a table with variants and its frequencies present in hive.(We used the dbNSFP database to generate one) with following columns:
+--------------+------------+
| col_name | data_type |
+--------------+------------+
| reference | string |
| alternative | string |
| hg19_chr | string |
| hg19_pos | int |
| exac_ac | string |
| exac_af | double |
| exac_adj_ac | string |
| exac_adj_af | double |
| exac_afr_ac | string |
| exac_afr_af | double |
| exac_amr_ac | string |
| exac_amr_af | double |
| exac_eas_ac | string |
| exac_eas_af | double |
| exac_fin_ac | string |
| exac_fin_af | double |
| exac_nfe_ac | string |
| exac_nfe_af | double |
| exac_sas_ac | string |
| exac_sas_af | double |
| mean | double |
+--------------+------------+
One has to provide a file with population of countries which to choose from. Example format:
4,AF,Afghanistan,Asia,31627506
8,AL,Albania,Europe,2894475
10,AQ,Antarctica,other,0
12,DZ,Algeria,Africa,38934334
16,AS,American Samoa,Oceania,55434
20,AD,Andorra,Europe,72786
24,AO,Angola,Africa,24227524
28,AG,Antigua and Barbuda,Americas,90900
31,AZ,Azerbaijan,Asia,9537823
32,AR,Argentina,Americas,42980026
36,AU,Australia,Oceania,23490736
40,AT,Austria,Europe,8534492
Where columns are as follows: country id, country code, country name, region name, population.
Usage: spark-submit <spark-options> pl.edu.pw.elka.GenerateDataJob -a [AnnotationTable]
-d <CountryDictionaryPath> -s <SamplesNumber> -o <OutputPath>
-a, --annotations-table <arg> Name of table in hive that contains dbNSFP
annotations. (default = ANNOTATIONS)
-d, --dict-path <arg> Path to dictionary of countries with their
population for generating samples.
-o, --output-path <arg> Where to store the ocr files with generated
variants.
-s, --samples-number <arg> Number of samples to generate.
--help Show help message
First of all for each sample a country of origin is drawn. The countries are divided in regions for which custom allelic frequencies are present in the EXAC database. That is (Africa, Americas, Europa, Finnish, SouthAsian, WestAsian). The probability of choosing a country from each region is the same. After choosing a region, country is being drawn based on countries population.
For each variant from dbNSFP database a genotype is drawn based on proper allelic frequency. Probabilities for different genotypes are as follows:
Genotype | Probability |
---|---|
0/0 | 1 - 2 * af + af ^ 2 |
0/1 | 2 * af * (1 - af) |
1/1 | af ^ 2 |
Then for each variant that is not reference homozygot, allele depth and total depth is drawn as to have same mean as present in ExAC.