CMAT is a software toolkit and curation protocol for parsing and enriching ClinVar's XML data. To learn more about what is available in ClinVar, please refer to their website.
For instructions on how to process ClinVar data for the Open Targets platform, see here.
The code requires Python 3.8+, and you will also need Nextflow 21.10+ to run the pipelines. Refer to Nextflow documentation for specifics on installing Nextflow on your system.
To install CMAT, first either clone the repository or download the latest released version from here:
git clone [email protected]:EBIvariation/CMAT.git
# OR
wget -O CMAT.zip https://github.com/EBIvariation/CMAT/archive/refs/tags/v3.0.3.zip
unzip CMAT.zip
Then install the library and its dependencies as follows (e.g. in a virtual environment):
cd CMAT
pip install -r requirements.txt
python setup.py install
You then need to set the PYTHON_BIN
variable in the Nextflow config, which will allow the
Nextflow processes to access the correct Python executable.
Finally, the instructions in this readme use the following environment variables as a convenience, they are not needed for the pipelines to run.
# Path to directory where source code is downloaded
export CODE_ROOT=
# Path to ontology mapping file (the provided path points to the version included in this repo)
export LATEST_MAPPINGS=${CODE_ROOT}/mappings/latest_mappings.tsv
If this is your first time running the pipelines with a specific target ontology (i.e. you don't have a latest mappings file to use),
you can use an empty TSV file containing just the header #ontology=<code>
, where <code>
is taken from this list of supportable ontologies.
This file will be filled with automated and manually curated mappings as processing continues.
To confirm everything is set up properly, you can run the annotation pipeline on the small dataset included with the tests.
It should take a couple minutes to run and generate a file annotated_clinvar.xml.gz
in the test directory.
mkdir testdir && cd testdir
nextflow run ${CODE_ROOT}/pipelines/annotation_pipeline.nf \
--output_dir . \
--clinvar ${CODE_ROOT}/tests/output_generation/resources/end2end/input.xml.gz \
--mappings ${LATEST_MAPPINGS}
You can also install CMAT using Conda.
For example the following installs CMAT in a new environment called cmat
, activates the environment, and prints usage:
conda create -n cmat -c conda-forge -c bioconda cmat
conda activate cmat
cmat
Note that with conda installation you can't invoke the pipelines directly via Nextflow, so you will need to use the corresponding cmat
commands - e.g. cmat annotate
instead of nextflow run annotation_pipeline.nf
.
All the same command line options apply.
CMAT includes a main annotation pipeline (which also performs consequence and gene mapping), as well as two pipelines to help manage trait mapping curation. It can also be used as a standard Python library.
This will annotate variants with genes and functional consequences, and annotate traits with ontology terms using an existing mappings file. It outputs the results as an annotated XML file.
# Directory to run annotation pipeline
export ANNOTATION_ROOT=
# Create directories for data processing
mkdir -p ${ANNOTATION_ROOT}
cd ${ANNOTATION_ROOT}
mkdir -p gene_mapping logs
# Run the nextflow pipeline, resuming execution of previous attempt if possible.
# For conda, use instead: cmat annotate
nextflow run ${CODE_ROOT}/pipelines/annotation_pipeline.nf \
--output_dir ${ANNOTATION_ROOT} \
--mappings ${LATEST_MAPPINGS} \
-resume
You can use the --include_transcripts
flag to also include transcript annotations with the functional consequences.
By default, the pipeline will download and annotate the latest ClinVar RCV XML dump from FTP. If you want to run it on an existing XML file, you can pass it via the --clinvar
flag.
These are processes to update the trait mappings used by the annotation pipeline and should be performed regularly to ensure new ClinVar data is mapped appropriately.
A complete protocol for trait curation can be found here, though it may require adaptation for your use case. A minimum set of steps to run the curation is provided in the sections below.
# Directory to run trait curation pipelines
export CURATION_ROOT=
# Path to previous curator comments to be included in spreadsheet.
# If this is the first round of curation, you can use an empty file.
export CURATOR_COMMENTS=
# Create directories for data processing
mkdir -p ${CURATION_ROOT}
cd ${CURATION_ROOT}
# Run the nextflow pipeline, resuming execution of previous attempt if possible.
# For conda, use instead: cmat generate-curation
nextflow run ${CODE_ROOT}/pipelines/generate_curation_spreadsheet.nf \
--curation_root ${CURATION_ROOT} \
--mappings ${LATEST_MAPPINGS} \
--comments ${CURATOR_COMMENTS} \
-resume
By default, the pipeline will download and map the latest ClinVar RCV XML dump from FTP. If you want to run it on an existing XML file, you can pass it via the --clinvar
flag.
To create the curation spreadsheet, first make your own copy of the template.
Then paste the contents of ${CURATION_ROOT}/google_sheets_table.tsv
into it, starting with column H “ClinVar label”.
This is done manually using the spreadsheet, ideally with a curator and at least one reviewer. The written protocol can be found here.
Once the manual curation is completed, the new mappings need to be incorporated into the set of latest mappings to be used for future annotation and trait curation.
Download the spreadsheet as a CSV file, making sure that all the data is visible before doing so (i.e., no filters are applied). Save the data to a file ${CURATION_ROOT}/finished_curation_spreadsheet.csv
.
cd ${CURATION_ROOT}
# Run the nextflow pipeline, resuming execution of previous attempt if possible.
# For conda, use instead: cmat export-curation
nextflow run ${CODE_ROOT}/pipelines/export_curation_spreadsheet.nf \
--input_csv ${CURATION_ROOT}/finished_curation_spreadsheet.csv \
--curation_root ${CURATION_ROOT} \
--mappings ${LATEST_MAPPINGS} \
-resume
CMAT can also be used as a normal Python library, for example:
from cmat.clinvar_xml_io import ClinVarDataset
for record in ClinVarDataset('/path/to/clinvar.xml.gz'):
s = f'{record.accession}: '
if record.measure and record.measure.has_complete_coordinates:
s += record.measure.vcf_full_coords
s += ' => '
s += ', '.join(trait.preferred_or_other_valid_name for trait in record.traits_with_valid_names)
# e.g. RCV001842692: 3_38633214_G_C => Cardiac arrhythmia
print(s)
If you find CMAT useful, you can cite the following:
Shen et al., CMAT: ClinVar Mapping and Annotation Toolkit. Bioinformatics Advances, 2024. doi:10.1093/bioadv/vbae018