tcga-pipeline

Downloads TCGA data from the Broad Institute's GDAC Firehose pipeline and stores it in convenient HDF files.

We've already done the hard work! Feel free to download the HDF files directly from Zenodo:

tcga_omic.tar.gz: multi-omic data (1.6GB)
tcga_clinical.tar.gz: clinical annotations (6.2MB)

The Cancer Genome Atlas (TCGA)

TCGA includes multi-omic and clinical data, collected from 11,000+ patients divided into 38 cancer types.
It is a really compelling dataset for supervised or unsupervised machine learning tasks.
You can access TCGA data via FireBrowse or the firehose_get command line tool.
Unfortunately, these tools (FireBrowse/Firehose) are inconvenient for the uninitiated.
- Unless you have a lot of bioinformatics background knowledge, it can be difficult to understand which data you should actually download.
- Once you've downloaded the data, it exists as several GB of zipped text files with long, complicated names.

This repository contains a complete workflow for (i) downloading useful kinds of TCGA data and (ii) storing it in a sensible format -- an HDF file.

Setup and execution

If you really want to run the code in this repo for yourself, then take the following steps:

Clone the repository: git clone [email protected]:dpmerrell/tcga-pipeline.
Set up your python environment. Install the dependencies: pip install -r requirements.txt. (I recommend doing this in a virtual environment.)
Make sure you have plenty of disk space. The downloaded, unzipped, and partially processed data will take a footprint of 280GB on disk. (In contrast, the final HDF files take <4GB unzipped.)
Make sure you have plenty of time. The downloads take a while -- consider running it overnight.
If you're feeling brave, adjust the parameters in the config.yaml file.
Run the Snakemake workflow: snakemake --cores 1. Using more cores doesn't necessarily buy you any speed for downloading data.

Structure of the data

Multi-omic data

This workflow produces an HDF file of multi-omic data, with the structure illustrated above.

/omic_data. HDF5 group containing multi-omic data, along with row and column information.
/barcodes. HDF5 group containing full TCGA barcodes for data samples. I.e., it gives the barcode for each sample and and each omic type (when it exists). Full barcodes are potentially useful for modeling batch effects.

Omic data feature names take the following form:

GENE_DATATYPE

Possible values for DATATYPE include mutation, cnv, methylation, mrnaseq, and rppa.

There are many missing values, indicated by NaNs -- not all measurements were taken for all patients.

Clinical data

This workflow also produces an HDF file of clinical data. Its structure is also shown in the figure.

Some provenance details

We download data from particular points in the Broad Intitute GDAC Firehose pipeline.

Copy number variation
- CopyNumber_Gistic2 node in this DAG
Somatic Mutation annotations
- Mutation_Packager_Oncotated_Calls node in this dag
Methylation
- Methylation_Preprocess node in this DAG
- We recover the barcodes for methylation samples by inspecting the corresponding entries in "humanmethylation450" files whenever they exist, and "humanmethylation27" thereafter.
Gene expression
- mRNAseq_Preprocess node in this DAG
- We recover the barcodes for RNAseq samples by inspecting the corresponding entries in "illuminahiseq" files whenever they exist, and "illuminaga" thereafter.
Reverse Phase Protein Array
- RPPA_AnnotateWithGene node in this DAG
Clinical Data
- Clinical_Pick_Tier1 node in this DAG

Licensing/Legal stuff

(c) David Merrell 2021

The software in this repository is distributed under an MIT license. See LICENSE.txt for details.

Note: downloading data from the BROAD TCGA GDAC site constitutes agreement to the TCGA Data Usage Policy:

https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy

See also this note from the NIH: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga

Guidelines for citing TCGA in your research
data usage in publications: ...all TCGA data are available without restrictions on their use in publications or presentations.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
requirements.txt		requirements.txt
tcga_data.png		tcga_data.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tcga-pipeline

The Cancer Genome Atlas (TCGA)

Setup and execution

Structure of the data

Multi-omic data

Clinical data

Some provenance details

Licensing/Legal stuff

About

Releases

Packages

Languages

License

dpmerrell/tcga-pipeline

Folders and files

Latest commit

History

Repository files navigation

tcga-pipeline

The Cancer Genome Atlas (TCGA)

Setup and execution

Structure of the data

Multi-omic data

Clinical data

Some provenance details

Licensing/Legal stuff

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages