This repository accompanies the work "Cell states and neighborhoods in distinct clinical stages of primary and metastatic esophageal adenocarcinoma" (ref) It contains all the code necessary to reproduce the analyses. Each subsection contains a README that contains a description of the path placeholders.
To reproduce the analyses, one first needs to download the data as described in the further sections "Links to data used in the study" and "Links to other data to download"; instructions and descriptions of file are provided in these sections.
Additionally, one needs to have an environment with all used packages correctly installed. The environment used to run the analyses is provided as a yaml file as eac_env.yml
. Specifically for the spatial analysis, a separate environment can be installed to run Cell2Location, under cell2loc_env.yml
(the interplay between package versions can be tricky to get).
Then, one needs to run the scripts in order, as some intermediate files generated by the scripts will be re-used in subsequent scripts.
The order to run scripts is depicted in the following illustration:
Instructions to replace placeholders are given in the README of each folder. Information on where to download all the data needed to reproduce the analysis is found in the following sections.
If there are any questions about code or issues to reproduce the analysis, please contact [email protected].
In the original paper, for simplicity patients are referred to as P1 through P10. In the scripts/notebooks the patients are referred to using their sample ID. The mapping is provided below.
Patient ID | Sample ID |
---|---|
P1 | CCG1153_4496262 |
P2 | CCG1153_6640539 |
P3 | CCG1153_4411 |
P4 | Aguirre_EGSFR0074 |
P5 | Aguirre_EGSFR0148 |
P6 | Aguirre_EGSFR1732 |
P7 | Aguirre_EGSFR0128 |
P8 | Aguirre_EGSFR1938 |
P9 | Aguirre_EGSFR1982 |
P10 | Aguirre_EGSFR2218 |
Dataset | Link to paper | Link to download | Remarks |
---|---|---|---|
Discovery dataset, sn 10X multiome | Yates et al., ??? | Download | |
Discovery dataset, ST 10X Visium | Yates et al., ??? | Download | |
Single-cell, Carroll et al. | Carroll et al., 2023 | Download | Need to request access to data through EGA / contact author ([email protected]) |
Bulk, Carroll et al., RNA | Carroll et al., 2023 | Download | Need to request access to data through EGA / contact author ([email protected]) |
Bulk, Carroll et al., Clinical | Carroll et al., 2023 | Download | Inoperable cohort info is located here |
Single-cell, Croft et al. | Croft et al., 2022 | Download | Need to request single-cell annotations from author ([email protected]) |
Bulk, Hoefnagel et al., RNA | Hoefnagel et al., 2022 | Download | |
Bulk, Hoefnagel et al., Clinical | Hoefnagel et al., 2022 | NA | Need to request from the author ([email protected]) |
Bulk, TCGA, RNA (FPKM) | The Cancer Genome Atlas Research Network, 2017 | Download | Used in the general TCGA analysis script, file named "TCGA-ESCA.htseq_fpkm-uq.tsv.gz" |
Bulk, TCGA, RNA (Raw counts) | The Cancer Genome Atlas Research Network, 2017 | Download | Used as a basis to deconvolve for BayesPrism |
Bulk, TCGA, Clinical #1 | The Cancer Genome Atlas Research Network, 2017 | Download | This is the general clinical+phenotypical info, named "TCGA.ESCA.sampleMap_ESCA_clinicalMatrix" |
Bulk, TCGA, Clinical #2 | The Cancer Genome Atlas Research Network, 2017 | Download | This is the clinical info provided in the original paper, need to save as "ESCA_Nature_clinicalinfo.csv" |
Bulk, TCGA, Clinical #3 | The Cancer Genome Atlas Research Network, 2017 | Download | This is the survival information, file named "Survival_SupplementalTable_S1_20171025_xena_sp" |
Bulk, TCGA, Clinical #4 | The Cancer Genome Atlas Research Network, 2017 | Download | This is the HRD information, file named "TCGA.HRD_withSampleID.txt" |
Bulk, TCGA, ABSOLUTE purity | The Cancer Genome Atlas Research Network, 2017 | Download | This is the ABSOLUE-estimate purity used for assessment of BayesPrism deconvolution, file named "TCGA_absolute_purity.txt" in the script |
Single-cell, Luo et al. | Luo et al., 2022 | Download | Need to download counts and metadata at the same time from this link |
Data | Needed for what script? | Description | Link to download |
---|---|---|---|
Gene Mapping | R/scripts/BayesPrism/runBPrism.R | Gene probe map fro the UCSC Xena browser that maps ENCODE to official gene ID | Download |
GENCODE annotations | python/notebooks/preprocessing-snRNA/XXXX.ipynb (where XXX is any sample name) | Subset of gencode annotations v41 | Download or link to original GTF file |
Gene Programs from Gavish et al. | python/notebooks/analysis/5. cNMFCancerCells-perPatient.ipynb | Signature genes derived in the Gavish et al. paper | Download or link to original Excel file, the .csv corresponds to the first sheet only |
MSigDB Hallmarks of cancer GMT | python/notebooks/analysis/5. cNMFCancerCells-perPatient.ipynb | This is the file to run GSEA on the hallmarks of cancer | Download |
List of human transcription factors | python/notebooks/analysis/9. SCENICplus-analyze-cNMF.ipynb | This file contains all known human transcription factors as defined in the Lambert et al. paper | Download |
Cell cycle genes | python/notebooks/validation/3. Carroll-validation-set.ipynb | This file contains cell cycle genes used by Scanpy | Download |
Marker genes of Barrett's esophagus cell types | python/notebooks/validation/4. compare-Nowicki-BE.ipynb | Set of fi!les, each containing marker genes of the Barrett's esophagus non-immune or stromal cell types | Download or Original paper tables; marker genes are derived from Suppl Table 7 |
Blacklisted regions of hg38 | python/scripts/scenicplus/1. run-pre-scenicplus-script.py | List of blacklisted regions to remove for analysis | Download |
Screen v10 region-based databases, SCENIC+, #1 | python/scripts/scenicplus/1. run-pre-scenicplus-script.py | Ranking database of motifs | Download |
Screen v10 region-based databases, SCENIC+, #2 | python/scripts/scenicplus/1. run-pre-scenicplus-script.py | Scores database of motifs | Download |
Motif v10 annotation, SCENIC+ | python/scripts/scenicplus/1. run-pre-scenicplus-script.py | Motif annotation | Download |
Screen v10 hg38 database, SCENIC, #1 | python/scripts/pyscenic/README.md | Ranking of motifs, big search space | Download |
Screen v10 hg38 database, SCENIC, #2 | python/scripts/pyscenic/README.md | Ranking of motifs, small search space | Download |
Annotation for local pycisTarget run | python/scripts/scenicplus/1. run-pre-scenicplus-script.py | This is required if the HPC used to run pycisTarget does not have access to the internet | Download |
Annotation for local SCENIC+ search space run | python/scripts/scenicplus/2. run-scenicplus-script.py | This is required if the HPC used to run SCENIC+ does not have access to the internet | Download, this contains two files, 'annot_ensembl.csv' and 'chromsizes_ensembl.csv'. More info on why we need to do this can be found in this issue |
List of Lambert et al. TF names | python/scripts/scenicplus/2. run-scenicplus-script.py | List of all human TFs used for the search space | Download can be done using !wget -O utoronto_human_tfs_v_1.01.txt http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt , as recommended in the SCENIC+ tutorial. FYI, this is the same list as above, simply formatted for the SCENIC+ run |
Omnipath database of intercellular communication | python/notebooks/spatial-transcriptomics/2. SpatialData_analysis.ipynb | List of ligand receptor interactions aggregated with omnipath, used to run LIGREC, a Squipy implementation of CellPhoneDB | The user should run omnipath.interactions.import_intercell_network with default parameters and save the resulting csv. NOTE: we use the presaved .csv because the HPC used doesn't have access to the internet; otherwise this is equivalent to running Squidpy's LIGREC with default parameters - the same database is automatially downloaded |