This repository holds all the scripts I have used for the characterisation of the FP-specific transcriptome and translatome and its downstream analyses. The goal was to find genes that were driven by the protein produced by the fusion of either PAX3 / PAX7 with FOXO1. I have used a combination of in-house RNA-seq from the Maxima, St Jude pediatric hospital and tumor organoid models generated by the Drost lab.
All tools were ran on our Utrecht HPC using containers in singularity. The following docker images were used. They were extracted from the dockerhub if no specific location was provided.
Tool name | Version | Docker link |
---|---|---|
MultiQC | 1.11 | nanozoo/multiqc:1.11--9dfdee6 |
Cutadapt | 3.4 | quay.io/biocontainers/cutadapt:3.4--py37h73a75cf_1 |
FastQC | 0.11.9 | staphb/fastqc:0.11.9 |
TrimGalore | 0.6.6 | vanheeschlab/trimgalore:0.6.6 |
STAR | 2.7.8a | mateongenaert/star:2.7.8a |
samtools | 1.12 | staphb/samtools:1.12 |
stringtie | 2.1.5 | bschiffthaler/stringtie:2.1.5 |
gffcompare | 0.12.6 | quay.io/biocontainers/gffcompare:0.12.6--h4ac6f70_2 |
gffread | 0.12.6 | bschiffthaler/gffread:0.12.6 |
salmon | 1.8.0 | combinelab/salmon:1.8.0 |
howarewestrandedhere | 1.0.1 | vanheeschlab/howarewestrandedhere:1.0.1 |
bowtie2 | 2.4.2 | quay.io/biocontainers/bowtie2:2.4.2--py37he8e2a3f_2 |
ORFquant | R v4.1.2 | vanheeschlab/orfquant:4.1.2a |
bedtools | 2.31.0 | pegi3s/bedtools:latest |
MACS2 | 2.2.7.1 | fooliu/macs2:version-2.2.7.1 |
When I started really diving into the project, I created rms_analysis
as a catch-all for most files I have generated to have them in a neat tidy location for myself. I have documented which files are pulled from where for the next person to take over. In addition, I have provided some information regarding the code sub-directories.
The first step was to find new transcripts and genes present in the tumor RNA-seq. After assembling the transcriptome, we quantified and analysed the expression of the RMS transcriptome against various cohorts, including GTEx, EVO-DEVO from the Kaessman lab and other in-house tumor cohorts. We established RMS-specific genes using the thresholds and filters established in the quantification part.
- annotation
- The previous annotation that was generated for the project was not compatible with the containerised R version. These scripts generate a new custom annotation package for this analysis.
- Correlations
- Create gene-gene correlations to see which genes are similarly expressed
- Figures
- Markdown to generate various figures (heatmap, dot plots, volcano plots).
- QC
- Scripts which visualise various QC parameters.
- quant_all_cohorts
- Combine salmon quant files into single R object for downstream analysis.
- quantification
- Code to run Salmon quant for samples.
- rnaseq_pipeline
- Pipeline used to analyse the RNA-seq samples from a previous folder.
- starfusion
- Small script to run containerised version of STARfusion to check fusion status of the samples.
- transcriptome_characterisation
- Small script to visualise certain aspects of the transcriptome.
With the transcriptome generated in the previous section, I was able to look for new open reading frames (ORFs) in the ribo-seq generated using both patient tissue and tumor organoid models using both PRICE and ORFquant. The ORF calls were harmonised and the expression of each ORF in every sample was quantified using the in-frame P-sites. Using specific filters, I eventually created a list of FP-RMS translated (non-canonical) ORFs for further investigation.
- orfquant_merged
- Sub-step of the pipeline to merge all P-sites for ORF calling using ORFquant.
- riboseq_pipeline
- The pipeline used to process the ribo-seq samples for ORF calling.
- QC
- Scripts which visualise various QC parameters.
- price_pipeline
- Sub-step of the pipeline specifically to allow the output of the normal ribo-seq pipeline we use to be used for PRICE with an additional STAR alignment step.
- orf_annotation
- Code to re-annotate ORFs based on sequence and coordinate overlap between Ensembl / UniProt annotated protein sequences.
- orf_selection
- Markdown which specifies under which parameters I have selected ORFs for follow-up studies.
This is a subset of Amalia's pipeline to check for interesting characteristics of the found new ncORFs. So far, I have only looked at the MHC predictions with netMHCpan of the ncORFs and used the linked peptides to select potential candidates for further investigation for immunotherapy.
The idea was to look at translational efficiency of the tumor organoid models contrasting FN and FP samples. However, due to time constraints, this never took off. The ideas was to convert the CDS regions of the translated ORFs to sequences and use those as input for salmon on both the RNA-seq and ribosome profiling of the tumor organoid samples.
- integrated_omics
- Some code I've written where multiple data streams were connected
- rnaseq_TE
- Pipeline to align RNA-seq reads the exact same way as ribo-seq reads. Did not fulfill the requirements we needed for direct RNA-seq and ribo-seq comparisons.
- salmon_te_quant
- Actual TE processing pipeline, only required processed ORFs for the loci to quantify in both RNA-seq and ribosome profiling data.
Using a particular information-rich dataset for RMS tissue and cell-lines, I was able to generate IGV visualisation tracks using the code in this section.
We were interested in the HLA types of the tumor organoid models for downstream wet-lab validations of the predicted MHC binders in package 03. This pipeline uses arcasHLA to call the found HLA loci.