The code used for the analysis and production of the results described in the paper Machine learning based CRISPR gRNA design for therapeutic exon skipping.
The analysis was performed using Python 3.7.5 and Jupyter. The dependencies are listed in environment.yml. We recommend using the conda package manager from Anaconda Python to create an environment for running the analysis:
conda env create -f environment.yml
Activate the environment by:
conda activate skipguide_data_processing
The provided Jupyter notebooks (see Usage section) can produce all the results starting from the raw sequencing data. However, computations can take a very long time, on the order of hours or days depending on computational resources. The notebooks are configured to skip certain long computations if pre-computed files are available. We recommend you instead download the pre-computed files before running the notebooks.
If you opt to not use the pre-computed files, the raw sequencing data needs to be available. Download them from here (raw/archive.tar.bz2
), extract, and place the *.fastq
files in the data/reads
directory before running the provided notebooks. Alternatively, the same *.fastq
files are available on NCBI SRA, BioProject accession PRJNA647416. Running all the notebooks may take on the order of hours or days depending on computational resources.
If you opt to use the pre-computed files, the raw sequencing data is not necessary. Download the pre-computed files from here (precomputed/cache.tar.xz
), extract, and replace the cache
directory with the extracted cache
directory. Running all the notebooks should then take less than half an hour.
You can open the provided Jupyter notebooks under src
and view the outputs. This section details how you can run the notebooks from scratch.
See Data Files section to include the necessary data files.
If pre-computed files are not used, modify the NUM_PROCESSES
variable in config.py to specify the number of cores for multiprocessing.
Start a Jupyter notebook server, e.g.:
jupyter notebook --port=8888
Run the provided notebooks under src
in the following order:
- Sequence_Extraction.ipynb
- Barcode_Sequence_Lookup_Tables.ipynb
- datA_Characterize_Sequences_Indels.ipynb
- inDelphi_Evaluation.ipynb
- inDelphi_Check_Data_Leakage.ipynb
- datB_Characterize_Skipping.ipynb
- SpliceAI_Predict_Skipping.ipynb
- MMSplice_Predict_Skipping.ipynb
- MetaSplice_SkipGuide_Evaluation.ipynb
Inspect the comments and markdown in the notebooks for more context.