Skip to content

Scripts that will take export files of DNA variant data from the Dutch genome diagnostic laboratories (VKGL), group the data, normalize the data, and import it into an LOVD3 or LOVD+ instance, possibly updating previous records.

License

Notifications You must be signed in to change notification settings

LOVDnl/VKGL_import

Repository files navigation

VKGL import

Every three months, the Dutch genome diagnostic laboratories release their new data. This script will take their export files, normalize and map the variants, regroup the variants, and import everything into an LOVD3 instance. If previous records are found, they will be updated. Records no longer found in LOVD3 will only be marked as removed, if the user requests so.

Grouping the center's raw data files

We used to work with the grouped file generated by the Groningen center. However, they're leaving data out that we'd like to keep, so now we're using the center's raw data files. The format_raw_VKGL_files.php script takes each center's raw data file, and creates one grouped file in the format that Groningen used to provide.

./format_raw_VKGL_files.php file_center_A.txt [file_center_B.txt [ ... ]] [-y]

If the formatter does not receive a data file for a center mentioned in the settings.json file, it will throw an error. The formatter currently recognizes four different file formats; two different file formats from Agilent Alissa exports, the LUMC file format, and the file format used by Radboud and MUMC+. The output is according to the Groningen file that we used to receive, using VCF fields. Variants described the same will be grouped, but this script will not try to normalize variants as VCF fields are not suitable for that. Therefore, the resulting output will not be fully grouped. LUMC variants, reported using HGVS, will be translated to a fake VCF-like format. Conversion to HGVS and full normalization will anyway be taken care of by the processing script.

Processing the data

process_VKGL_data.php file_to_import.tsv [-y]

Note that the script is interactive. The first time that it will run, it will ask you all the information it needs to process the data. Settings are stored in a settings.json file. Only when settings have been provided before, the -y flag can be used. Passing the -y flag will accept all previous settings.

Output

The script will write output to the terminal, informing you of its progress. Note that the script will only write a new line in case of an error, or if the progress has increased by at least 0.1%. Running large files with 100,000 variants or more and an empty cache may cause the script to create no new output for a few minutes.

Caches

This script caches data from Mutalyzer in an NC_cache.txt file and an mapping_cache.txt file.

NC cache

This file contains variant descriptions on the genome (NC reference sequences) and their normalized counterparts. The file does not need to be sorted. An example line looks like:

NC_000001.10:g.100387136_100387137insA  NC_000001.10:g.100387137dup

Note that both values may be the same, in the case the variant can not be normalized. This script will build this cache if you do not have it, but since building the cache may take a long time, it is recommended to use the NC cache from the caches project.

The script will store errors using JSON, like so:

NC_000001.10:g.150771703C>T     {"EREF":"C not found at position 150771703, found T instead."}
Mapping cache

The mapping cache contains mapping data from two Mutalyzer webservices, both the runMutalyzerLight and the numberConversion methods. Because both methods provide partially overlapping data, the results are stored together. The cache stores the method(s) used; if the runMutalyzerLight webservice didn't provide enough transcripts, the numberConversion service can be used and the additional data is added to the cache in a new line.

The file does not need to be sorted, but sorting may help in finding duplicate variants. An example line looks like:

NC_000001.10:g.100154502A>G     {"NM_017734.4":{"c":"c.686A>G","p":"p.(Asn229Ser)"},"methods":["runMutalyzerLight"]}
NC_000001.10:g.13413980G>A      {"NM_001291381.1":{"c":"c.923G>A","p":"p.?"},"methods":["runMutalyzerLight","numberConversion"]}
NC_000001.10:g.13634793G>T      {"methods":["runMutalyzerLight","numberConversion"]}

The third line in this example shows a variant where no mapping data could be found, using either Mutalyzer method.

About

Scripts that will take export files of DNA variant data from the Dutch genome diagnostic laboratories (VKGL), group the data, normalize the data, and import it into an LOVD3 or LOVD+ instance, possibly updating previous records.

Resources

License

Stars

Watchers

Forks