diff --git a/2024.3/.DS_Store b/2024.3/.DS_Store index a86ce31..f70b9b2 100644 Binary files a/2024.3/.DS_Store and b/2024.3/.DS_Store differ diff --git a/2024.3/assets/images/download_file.gif b/2024.3/assets/images/download_file.gif new file mode 100644 index 0000000..0d7198c Binary files /dev/null and b/2024.3/assets/images/download_file.gif differ diff --git a/2024.3/assets/images/igv_ELOVL5.png b/2024.3/assets/images/igv_ELOVL5.png new file mode 100644 index 0000000..56b74fc Binary files /dev/null and b/2024.3/assets/images/igv_ELOVL5.png differ diff --git a/2024.3/assets/images/new_file.gif b/2024.3/assets/images/new_file.gif new file mode 100644 index 0000000..11e2dc6 Binary files /dev/null and b/2024.3/assets/images/new_file.gif differ diff --git a/2024.3/assets/images/open_terminal.gif b/2024.3/assets/images/open_terminal.gif new file mode 100644 index 0000000..c0280f7 Binary files /dev/null and b/2024.3/assets/images/open_terminal.gif differ diff --git a/2024.3/assets/images/vscode_login_page.png b/2024.3/assets/images/vscode_login_page.png new file mode 100644 index 0000000..4de7eb3 Binary files /dev/null and b/2024.3/assets/images/vscode_login_page.png differ diff --git a/2024.3/course_material/group_work/group_work/index.html b/2024.3/course_material/group_work/group_work/index.html index ed9c20d..ba246d0 100644 --- a/2024.3/course_material/group_work/group_work/index.html +++ b/2024.3/course_material/group_work/group_work/index.html @@ -582,7 +582,7 @@
Each project has tasks and questions. By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don’t have to perform all the tasks and answer all the questions.
In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.
Each group has access to a shared working directory. It is mounted in the root directory (/
).
Each group has access to a shared working directory. It is mounted in the root directory (/group_work/groupX
). You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory.
In this project, you will be working with data from the same resource as the data we have already worked on:
-Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain. Molecular Psychiatry, 25(1), 37–47. https://doi.org/10.1038/s41380-019-0583-1.
-It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR.
+++Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric Lécuyer. “Profiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.” BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.
+
It is Oxford Nanopore Technology sequencing data of cDNA from extracellular vesicles and whole cells. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR.
Project aim
Discover new splice variants and identify differentially expressed isoforms.
@@ -574,43 +576,25 @@Note
-Download the data file package in your shared working directory, i.e. : /group_work/<group name>
or ~/<group name>
. Only one group member has to do this.
Download the data file package in your shared working directory, i.e. : /group_work/<group name>
. Only one group member has to do this. You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory.
This will create a directory project1
with the following structure:
project1/
-├── alignments
-│ ├── cerebellum-5238-batch2.bam
-│ ├── cerebellum-5298-batch2.bam
-│ ├── cerebellum-5346-batch2.bam
-│ ├── parietal_cortex-5238-batch1.bam
-│ ├── parietal_cortex-5298-batch1.bam
-│ └── parietal_cortex-5346-batch1.bam
-├── counts
-│ └── counts_matrix_test.tsv
├── reads
-│ ├── cerebellum-5238-batch2.fastq.gz
-│ ├── cerebellum-5298-batch2.fastq.gz
-│ ├── cerebellum-5346-batch2.fastq.gz
-│ ├── parietal_cortex-5238-batch1.fastq.gz
-│ ├── parietal_cortex-5298-batch1.fastq.gz
-│ ├── parietal_cortex-5346-batch1.fastq.gz
-│ ├── striatum-5238-batch2.fastq.gz
-│ ├── striatum-5298-batch2.fastq.gz
-│ └── striatum-5346-batch2.fastq.gz
+│ ├── Cell_1.fastq.gz
+│ ├── Cell_2.fastq.gz
+│ ├── Cell_3.fastq.gz
+│ ├── EV_1.fastq.gz
+│ ├── EV_2.fastq.gz
+│ └── EV_3.fastq.gz
├── reads_manifest.tsv
-└── scripts
- └── differential_expression_example.Rmd
+└── references
+ ├── Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf
+ └── Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa
-4 directories, 18 files
-
Download the fasta file and gtf like this:
-cd project1/
-mkdir reference
-cd reference
-wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz
-wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz
-gunzip *.gz
+2 directories, 9 files
In the reads folder a fastq file with reads, which are described in reads_manifest.csv
. EV means ‘extracellular vesicle’, Cell means ‘entire cells’. In the references folder you can find the reference sequence and annotation.
You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are:
We will be working with data from:
--Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain. Molecular Psychiatry, 25(1), 37–47. https://doi.org/10.1038/s41380-019-0583-1
+Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric Lécuyer. “Profiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.” BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.
The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk.
-For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory.
-cd ~/workdir
-wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz
-tar -xvf ngs-longreads-training.tar.gz
-rm ngs-longreads-training.tar.gz
+The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture.
+For the exercises of today, we will work with two samples of this study. Download and unpack the data files in your home directory.
+cd ~/project
+wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz
+tar -xvf project1.tar.gz
+rm project1.tar.gz
-Exercise: This will create the directory data
. Check out what’s in there.
+Exercise: This will create the directory called project1
. Check out what’s in there.
Answer
-Go to the ~/workdir/data
folder:
-cd ~/data
+Go to the ~/project/project1
folder:
+cd ~/project/project1
The data folder contains the following:
-
data/
+project1/
├── reads
-│ └── cerebellum-5238-batch2.fastq.gz
-└── reference
- └── Homo_sapiens.GRCh38.dna.chromosome.12.fa
-
-2 directories, 2 files
+│ ├── Cell_1.fastq.gz
+│ ├── Cell_2.fastq.gz
+│ ├── Cell_3.fastq.gz
+│ ├── EV_1.fastq.gz
+│ ├── EV_2.fastq.gz
+│ └── EV_3.fastq.gz
+├── reads_manifest.tsv
+└── references
+ ├── Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf
+ └── Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa
+
+2 directories, 9 files
-In the reads folder a fastq file with reads, in the reference folder the reference sequence.
+In the reads folder a fastq file with reads, which are described in reads_manifest.csv
. EV means ‘extracellular vesicle’, Cell means ‘entire cells’. In the references folder you can find the reference sequence and annotation.
2. Quality control
-We will evaluate the read quality with NanoPlot
.
-Exercise: Check out the manual of NanoPlot
with the command NanoPlot --help
, and run NanoPlot
on ~/data/reads/cerebellum-5238-batch2.fastq.gz
.
+We will evaluate the read quality of two fastq files with NanoPlot
.
+Exercise: Check out the manual of NanoPlot
with the command NanoPlot --help
. After that run NanoPlot
on
+
+reads/Cell_2.fastq.gz
+reads/EV_2.fastq.gz
.
+
+Your fastq files are in the ‘rich’ format, meaning they have additional information regarding the ONT run.
Hint
-For a basic output of NanoPlot
on a fastq.gz
file you can use the options --outdir
and --fastq
.
+For a basic output of NanoPlot
on a fastq.gz
file you can use the options --outdir
and --fastq_rich
.
Answer
-We have a fastq
file, so based on the manual and the example we can run:
-cd ~/workdir
+We have a rich fastq
file, so based on the manual and the example we can run:
+cd ~/project/project1
+
+mkdir -p nanoplot
+
+NanoPlot \
+--fastq_rich reads/Cell_2.fastq.gz \
+--outdir nanoplot/Cell_2
+
NanoPlot \
---fastq data/reads/cerebellum-5238-batch2.fastq.gz \
---outdir nanoplot_output
+--fastq_rich reads/EV_2.fastq.gz \
+--outdir nanoplot/EV_2
-You will now have a directory with the following files:
-nanoplot_output
+In both directories you will now have a directory with the following files:
+.
+├── ActivePores_Over_Time.html
+├── ActivePores_Over_Time.png
+├── ActivityMap_ReadsPerChannel.html
+├── ActivityMap_ReadsPerChannel.png
+├── CumulativeYieldPlot_Gigabases.html
+├── CumulativeYieldPlot_Gigabases.png
+├── CumulativeYieldPlot_NumberOfReads.html
+├── CumulativeYieldPlot_NumberOfReads.png
├── LengthvsQualityScatterPlot_dot.html
├── LengthvsQualityScatterPlot_dot.png
├── LengthvsQualityScatterPlot_kde.html
├── LengthvsQualityScatterPlot_kde.png
+├── NanoPlot_20240221_1219.log
├── NanoPlot-report.html
-├── NanoPlot_20230309_1332.log
├── NanoStats.txt
├── Non_weightedHistogramReadlength.html
├── Non_weightedHistogramReadlength.png
├── Non_weightedLogTransformed_HistogramReadlength.html
├── Non_weightedLogTransformed_HistogramReadlength.png
+├── NumberOfReads_Over_Time.html
+├── NumberOfReads_Over_Time.png
+├── TimeLengthViolinPlot.html
+├── TimeLengthViolinPlot.png
+├── TimeQualityViolinPlot.html
+├── TimeQualityViolinPlot.png
├── WeightedHistogramReadlength.html
├── WeightedHistogramReadlength.png
├── WeightedLogTransformed_HistogramReadlength.html
├── WeightedLogTransformed_HistogramReadlength.png
├── Yield_By_Length.html
└── Yield_By_Length.png
+
+0 directories, 31 files
-The file NanoPlot-report.html
contains a report with all the information stored in the other files.
-Exercise: Download NanoPlot-report.html
to your local computer and answer the following questions:
-A. How many reads are in the file?
-B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected?
+The file NanoPlot-report.html
contains a report with all the information stored in the other files, and NanoStats.txt
in text format.
+Exercise: Check out some of the .png plots and the contents of NanoStats.txt
. Also, download NanoPlot-report.html
for both files to your local computer and answer the following questions:
+A. How many reads are in the files?
+B. What are the average read lengths? What does this tell us about the quality of both runs?
C. What is the average base quality and what kind of accuracy do we therefore expect?
Download files from the notebook
-You can download files from the file browser, by right-clicking a file and selecting Download:
+You can download files from the file browser, by right-clicking a file and selecting Download…:
Answer
-A. 3735
-B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected.
-C. The average base quality is 7.3. We have learned that \(p=10^{\frac{-baseQ}{10}}\), so the average probability that the base is wrong is \(10^{\frac{-7.3}{10}} = 0.186\). The expected accuracy is \(1-0.186=0.814\) or 81.4%.
+A. Cell_2: 49,808 reads; EV_2: 6,214 reads
+B. Cell_2: 1186.7 EV_2: 607.9. Both runs are form cDNA. Transcripts are usually around 1-2kb. The average read length is therefore quite for EV_2.
+C. The median base quality is for both around 12. This means that the error probability is about 10^(-12/10) = 0.06, so an accuracy of 94%.
3. Read alignment
The sequence aligner minimap2
is specifically developed for (splice-aware) alignment of long reads.
@@ -724,48 +759,26 @@ 3. Read alignment
We are working with ONT data so we could choose map-ont
. However, our data is also spliced. Therefore, we should choose splice
.
-Introns can be quite long in mammals; up to a few hundred kb.
-Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron.
-
-Hint
-First load hg38 in IGV, by clicking the topleft drop-down menu:
-
-After that type CACNA1C
in the search box:
-
-
-
-Answer
-The longest intron is about 350 kilo bases (350,000 base pairs)
-
-Exercise: Check out the -G
option of minimap2
. How does this relate to the the largest intron size of CACNA1C?
-
-Answer
-This is what the manual says:
--G NUM max intron length (effective with -xsplice; changing -r) [200k]
-
-We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k.
-
Exercise: Make a directory called alignments
in your working directory. After that, modify the command below for minimap2
and run it from a script.
#!/usr/bin/env bash
-cd ~/workdir
+cd ~/project/project1
-minimap2 \
--a \
--x [PARAMETER] \
--G [PARAMETER] \
--t 4 \
-data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \
-data/reads/cerebellum-5238-batch2.fastq.gz \
-| samtools sort \
-| samtools view -bh > alignments/cerebellum-5238-batch2.bam
+mkdir -p alignments
-## indexing for IGV
-samtools index alignments/cerebellum-5238-batch2.bam
+for sample in EV_2 Cell_2; do
+ minimap2 \
+ -a \
+ -x [PARAMETER] \
+ -t 4 \
+ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \
+ reads/"$sample".fastq.gz \
+ | samtools sort \
+ | samtools view -bh > alignments/"$sample".bam
+
+ ## indexing for IGV
+ samtools index alignments/"$sample".bam
+done
Note
@@ -773,26 +786,26 @@ 3. Read alignment
Answer
-Make a directory like this:
-mkdir ~/workdir/alignments
-
-Modify the script to set the -x
and -G
options:
+Modify the script to set the -x
option:
#!/usr/bin/env bash
-cd ~/workdir
+cd ~/project/project1
-minimap2 \
--a \
--x splice \
--G 500k \
--t 4 \
-data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \
-data/reads/cerebellum-5238-batch2.fastq.gz \
-| samtools sort \
-| samtools view -bh > alignments/cerebellum-5238-batch2.bam
+mkdir -p alignments
-## indexing for IGV
-samtools index alignments/cerebellum-5238-batch2.bam
+for sample in EV_2 Cell_2; do
+ minimap2 \
+ -a \
+ -x splice \
+ -t 4 \
+ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \
+ reads/"$sample".fastq.gz \
+ | samtools sort \
+ | samtools view -bh > alignments/"$sample".bam
+
+ ## indexing for IGV
+ samtools index alignments/"$sample".bam
+done
And run it (e.g. if you named the script ont_alignment.sh
):
chmod u+x ont_alignment.sh
@@ -800,14 +813,30 @@ 3. Read alignment
4. Visualisation
-Let’s have a look at the alignments. Download the files cerebellum-5238-batch2.bam
and cerebellum-5238-batch2.bam.bai
to your local computer and load the .bam
file into IGV (File > Load from File…).
-Exercise: Have a look at the region chr12:2,632,655-2,635,447
by typing it into the search box. Do you see any evidence for alternative splicing already?
+Let’s have a look at the alignments. Download the files (in ~/project/project1/alignments
):
+
+EV_2.bam
+EV_2.bam.bai
+Cell_2.bam
+Cell_2.bam.bai
+
+to your local computer and load the .bam
files into IGV (File > Load from File…).
+Exercise: Have a look at the gene ELOVL5
by typing the name into the search box.
+
+- Do you see any evidence for alternative splicing already?
+- How is the difference in quality between the two samples? Would that have an effect on estimating differential splicing?
+
+
+Check out the paper
+The authors found splice variants. Check figure 5B in the paper.
+
Answer
-The two exons seem to be mutually exclusive:
+There is some observable exon skipping in Cell_2:
+The coverage of EV_2 is quite low. Also, a lot of the reads do not fully cover the gene. This will make it difficult to estimate differential splicing.
diff --git a/2024.3/course_material/server_login/index.html b/2024.3/course_material/server_login/index.html
index a15df5b..b0b9dfd 100644
--- a/2024.3/course_material/server_login/index.html
+++ b/2024.3/course_material/server_login/index.html
@@ -330,6 +330,26 @@
Learning outcomes
+
+
+ -
+
+ Exercises
+
+
+
+
-
@@ -340,7 +360,7 @@
-
-
+
Exercises
@@ -348,7 +368,7 @@
-
-
+
First login
@@ -400,13 +420,6 @@
-
-
- -
-
- Loops
-
-
@@ -575,6 +588,26 @@
Learning outcomes
+
+
+ -
+
+ Exercises
+
+
+
+
-
@@ -585,7 +618,7 @@
-
-
+
Exercises
@@ -593,7 +626,7 @@
-
-
+
First login
@@ -645,13 +678,6 @@
-
-
- -
-
- Loops
-
-
@@ -710,24 +736,26 @@ Learning outcomes
- Docker
-
+
-If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002
) in your browser. This should result in the following page:
-
-Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal:
+Exercises
+First login
+If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002
) in your browser. This should result in the following page:
-For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File:
+
+Info
+The link gives you access to a web version of Visual Studio Code. This is a powerful code editor that you can also use as a local application on your computer.
+
+Type in the password that was provided to you by the teacher. Now let’s open the terminal. You can do that with ++ctrl+grave++. Or by clicking Application menu > Terminal > New Terminal:
-Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py
extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here’s an example for adjusting it to shell syntax highlighting:
+For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the ‘new file’ button:
@@ -736,8 +764,8 @@ Material
- Instructions to install docker
- Instructions to set up to container
-Exercises
-First login
+Exercises
+First login
Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system.
In the video below there’s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10.
@@ -745,89 +773,47 @@ First login
Modify the script
Modify the path after -v
to the working directory on your computer before running it.
+
docker run \
--rm \
--e JUPYTER_ENABLE_LAB=yes \
--v /path/to/workingdir/:/home/jovyan \
--p 8888:8888 \
-geertvangeest/ngs-longreads-jupyter:latest \
-start-notebook.sh
-
-
-If this command has run successfully, you will find a link and token in the console, e.g.:
-http://127.0.0.1:8888/?token=4be8d916e89afad166923de5ce5th1s1san3xamp13
+-p 8443:8443 \
+-e PUID=1000 \
+-e PGID=1000 \
+-e DEFAULT_WORKSPACE=/config/project \
+-v $PWD:/config/project \
+geertvangeest/ngs-longreads-vscode:latest
-Copy this URL into your browser, and you will be able to use the jupyter notebook.
-The option -v
mounts a local directory in your computer to the directory /home/jovyan
in the docker container (‘jovyan’ is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.
+If this command has run successfully, navigate in your browser to http://localhost:8443.
+The option -v
mounts a local directory in your computer to the directory /config/project
in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.
Don’t mount directly in the home dir
Don’t directly mount your local directory to the home directory (/root
). This will lead to unexpected behaviour.
-The part geertvangeest/ngs-longreads-jupyter:latest
is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it’s on your computer, it will start immediately.
-
-
-If you have a conda installation on your local computer, you can install the required software using conda.
-You can build the environment from ngs-longreads.yml
-Generate the conda environment like this:
-conda env create --name ngs-longreads -f ngs-longreads.yml
-
-
-The yaml
file probably only works for Linux systems
-If you want to use the conda environment on a different OS, use:
-conda create -n ngs-longreads python=3.6
-
-conda activate ngs-longreads
-
-conda install -y -c bioconda \
-samtools \
-minimap2 \
-fastqc \
-pbmm2 \
-
-conda install -y -c bioconda nanoplot
-
-If the installation of NanoPlot
fails, try to install it with pip
:
-pip install NanoPlot
-
-
-This will create the conda environment ngs-longreads
-Activate it like so:
-conda activate ngs-longreads
-
-After successful installation and activating the environment all the software required to do the exercises should be available.
+The part geertvangeest/ngs-longreads-vscode:latest
is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it’s on your computer, it will start immediately.
A UNIX command line interface (CLI) refresher
-Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.
+Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.
+If you need some reminders of the commands, here’s a link to a UNIX command line cheat sheet:
+
Make a new directory
-Login to the server and use the command line to make a directory called workdir
.
-
-If working with Docker
-If your are working with docker you are a root user. This means that your “home” directory is the root directory, i.e. /root
, and not /home/username
. If you have mounted your local directory to /root/workdir
, this directory should already exist.
-
+Make a directory scripts
within ~/project
and make it your current directory.
Answer
-cd
-mkdir workdir
-
-
-Make a directory scripts
within ~/workdir
and make it your current directory.
-
-Answer
-cd workdir
+cd ~/project
mkdir scripts
cd scripts
File permissions
-Generate an empty script in your newly made directory ~/workdir/scripts
like this:
+Generate an empty script in your newly made directory ~/project/scripts
like this:
touch new_script.sh
Add a command to this script that writes “SIB courses are great!” (or something you can better relate to.. ) to stdout, and try to run it.
Answer
-The script should look like this:
+Generate a script as described above. The script should look like this:
#!/usr/bin/env bash
echo "SIB courses are great!"
@@ -868,10 +854,10 @@ File permissions
More on chmod
and file permissions here.
Redirection: >
and |
-In the root directory (go there like this: cd /
) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt
in your working directory (~/workdir
; use ls
and >
).
+In the root directory (go there like this: cd /
) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt
in your working directory.
Answer
-ls / > ~/workdir/system_dirs.txt
+ls / > ~/project/system_dirs.txt
The command wc -l
counts the number of lines, and can read from stdin. Make a one-liner with a pipe |
symbol to find out how many system directories and files there are.
@@ -884,8 +870,7 @@ Variables
Store system_dirs.txt
as variable (like this: VAR=variable
), and use wc -l
on that variable to count the number of lines in the file.
Answer
-cd ~/workdir
-FILE=system_dirs.txt
+FILE=~/project/system_dirs.txt
wc -l $FILE
@@ -899,79 +884,6 @@ shell scripts
ls | wc -l
-Loops
-
- 20 minutes
-
-If you want to run the same command on a range of arguments, it’s not very convenient to type the command for each individual argument. For example, you could write dog
, fox
, bird
to stdout in a script like this:
-#!/usr/bin/env bash
-
-echo dog
-echo fox
-echo bird
-
-However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this:
-#!/usr/bin/env bash
-
-ANIMALS="dog fox bird"
-
-for animal in $ANIMALS
-do
- echo $animal
-done
-
-Which results in:
-dog
-fox
-bird
-
-Write a shell script that removes all the letters “e” from a list of words.
-
-Hint
-Removing the letter “e” from a string can be done with tr
like this:
-
word="test"
-echo $word | tr -d "e"
-
-Which would result in:
-tst
-
-
-
-Answer
-Your script should e.g. look like this (I’ve added some awesome functionality):
-#!/usr/bin/env bash
-
-WORDLIST="here is a list of words resulting in a sentence"
-
-for word in $WORDLIST
-do
- echo "'$word' with e's removed looks like:"
- echo $word | tr -d "e"
-done
-
-resulting in:
-'here' with e's removed looks like:
-hr
-'is' with e's removed looks like:
-is
-'a' with e's removed looks like:
-a
-'list' with e's removed looks like:
-list
-'of' with e's removed looks like:
-of
-'words' with e's removed looks like:
-words
-'resulting' with e's removed looks like:
-rsulting
-'in' with e's removed looks like:
-in
-'a' with e's removed looks like:
-a
-'sentence' with e's removed looks like:
-sntnc
-
-
diff --git a/2024.3/search/search_index.json b/2024.3/search/search_index.json
index f43bd92..5ff4de9 100644
--- a/2024.3/search/search_index.json
+++ b/2024.3/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Learning outcomes General learning outcomes After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Home"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Applicationns - PacBio 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Quality control and Read alignment Day 2 block start end subject block 1 9:15 AM 10:00 AM Talk + Q&A with Alban Ramette (ONT) 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Talk + Q&A with Pamela Nicholson (PacBio) block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:15 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Applicationns - PacBio 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Quality control and Read alignment","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:00 AM Talk + Q&A with Alban Ramette (ONT) 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Talk + Q&A with Pamela Nicholson (PacBio) block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:15 PM Presentations","title":"Day 2"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"course_material/applications/","text":"Learning outcomes After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly Material Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Applications"},{"location":"course_material/applications/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly","title":"Learning outcomes"},{"location":"course_material/applications/#material","text":"Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Material"},{"location":"course_material/introduction/","text":"Learning outcomes After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations Material The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Introduction"},{"location":"course_material/introduction/#learning-outcomes","text":"After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations","title":"Learning outcomes"},{"location":"course_material/introduction/#material","text":"The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Material"},{"location":"course_material/qc_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer Material Download the presentation Exercises 1. Retrieve data We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence. 2. Quality control We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%. 3. Read alignment The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh 4. Visualisation Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"QC and alignment"},{"location":"course_material/qc_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer","title":"Learning outcomes"},{"location":"course_material/qc_alignment/#material","text":"Download the presentation","title":"Material"},{"location":"course_material/qc_alignment/#exercises","text":"","title":"Exercises"},{"location":"course_material/qc_alignment/#1-retrieve-data","text":"We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence.","title":"1. Retrieve data"},{"location":"course_material/qc_alignment/#2-quality-control","text":"We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%.","title":"2. Quality control"},{"location":"course_material/qc_alignment/#3-read-alignment","text":"The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh","title":"3. Read alignment"},{"location":"course_material/qc_alignment/#4-visualisation","text":"Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"4. Visualisation"},{"location":"course_material/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available. A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. Make a new directory Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l Loops 20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Server login"},{"location":"course_material/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting:","title":"Learning outcomes"},{"location":"course_material/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"course_material/server_login/#exercises","text":"","title":"Exercises"},{"location":"course_material/server_login/#first-login","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available.","title":"First login"},{"location":"course_material/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.","title":"A UNIX command line interface (CLI) refresher"},{"location":"course_material/server_login/#make-a-new-directory","text":"Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts","title":"Make a new directory"},{"location":"course_material/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"course_material/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"course_material/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"course_material/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"course_material/server_login/#loops","text":"20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Loops"},{"location":"course_material/group_work/group_work/","text":"Learning outcomes After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline Introduction The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( / ).","title":"Introduction"},{"location":"course_material/group_work/group_work/#learning-outcomes","text":"After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline","title":"Learning outcomes"},{"location":"course_material/group_work/group_work/#introduction","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.","title":"Introduction"},{"location":"course_material/group_work/group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"course_material/group_work/group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( / ).","title":"Working directories"},{"location":"course_material/group_work/project1/","text":"Project 1: Differential isoform expression analysis of ONT data In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis. Tasks & questions Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Project 1"},{"location":"course_material/group_work/project1/#project-1-differential-isoform-expression-analysis-of-ont-data","text":"In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz","title":" Project 1: Differential isoform expression analysis of ONT data"},{"location":"course_material/group_work/project1/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis.","title":"Before you start"},{"location":"course_material/group_work/project1/#tasks-questions","text":"Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Tasks & questions"},{"location":"course_material/group_work/project2/","text":"Project 2: Repeat expansion analysis of PacBio data You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project2.tar.gz tar -xvf project2.tar.gz rm project2.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. It has the following directory structure: project2 \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u251c\u2500\u2500 reference \u2502 \u251c\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa \u2502 \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa.fai \u2514\u2500\u2500 targets \u2514\u2500\u2500 targets.bed 3 directories, 11 files The targets in gene1 and gene2 are described in targets/targets.bed . The columns in these .bed files describe the chromosome, start, end, and describe the motifs. To reduce computational load, the reference contains only chromosome 4 and X of the hg38 human reference genome. Tasks & questions Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to reference/Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Check out some of the bam files in IGV. How does that look? Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Use trgt to genotype the repeats. Basically, you want to know the expansion size of each repeat in each sample. Based on this, you can figure out which sample has abnormal expansions in which repeat. To run trgt read the manual . After the alignment, all required input files should be there. To visualize the output, use samtools to sort and index the bam file with the reads spanning the repeats (this is also explained in the manual - no need to run bcftools ). Run trvz to visualize the output. The allele plot should suffice. The visualization will give you a nice overview of the repeat expansions in the samples. Based on the different sizes of the repeat expansions, can you relate the repeat expansions to the disease phenotype? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Project 2"},{"location":"course_material/group_work/project2/#project-2-repeat-expansion-analysis-of-pacbio-data","text":"You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project2.tar.gz tar -xvf project2.tar.gz rm project2.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. It has the following directory structure: project2 \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u251c\u2500\u2500 reference \u2502 \u251c\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa \u2502 \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa.fai \u2514\u2500\u2500 targets \u2514\u2500\u2500 targets.bed 3 directories, 11 files The targets in gene1 and gene2 are described in targets/targets.bed . The columns in these .bed files describe the chromosome, start, end, and describe the motifs. To reduce computational load, the reference contains only chromosome 4 and X of the hg38 human reference genome.","title":" Project 2: Repeat expansion analysis of PacBio data"},{"location":"course_material/group_work/project2/#tasks-questions","text":"Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to reference/Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Check out some of the bam files in IGV. How does that look? Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Use trgt to genotype the repeats. Basically, you want to know the expansion size of each repeat in each sample. Based on this, you can figure out which sample has abnormal expansions in which repeat. To run trgt read the manual . After the alignment, all required input files should be there. To visualize the output, use samtools to sort and index the bam file with the reads spanning the repeats (this is also explained in the manual - no need to run bcftools ). Run trvz to visualize the output. The allele plot should suffice. The visualization will give you a nice overview of the repeat expansions in the samples. Based on the different sizes of the repeat expansions, can you relate the repeat expansions to the disease phenotype? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Tasks & questions"},{"location":"course_material/group_work/project3/","text":"Project 3: Assembly and annotation of bacterial genomes You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files Before you start You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka Tasks and questions Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Project 3"},{"location":"course_material/group_work/project3/#project-3-assembly-and-annotation-of-bacterial-genomes","text":"You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files","title":" Project 3: Assembly and annotation of bacterial genomes"},{"location":"course_material/group_work/project3/#before-you-start","text":"You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka","title":"Before you start"},{"location":"course_material/group_work/project3/#tasks-and-questions","text":"Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Tasks and questions"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Learning outcomes General learning outcomes After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Home"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Applicationns - PacBio 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Quality control and Read alignment Day 2 block start end subject block 1 9:15 AM 10:00 AM Talk + Q&A with Alban Ramette (ONT) 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Talk + Q&A with Pamela Nicholson (PacBio) block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:15 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Applicationns - PacBio 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Quality control and Read alignment","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:00 AM Talk + Q&A with Alban Ramette (ONT) 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Talk + Q&A with Pamela Nicholson (PacBio) block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:15 PM Presentations","title":"Day 2"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"course_material/applications/","text":"Learning outcomes After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly Material Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Applications"},{"location":"course_material/applications/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly","title":"Learning outcomes"},{"location":"course_material/applications/#material","text":"Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Material"},{"location":"course_material/introduction/","text":"Learning outcomes After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations Material The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Introduction"},{"location":"course_material/introduction/#learning-outcomes","text":"After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations","title":"Learning outcomes"},{"location":"course_material/introduction/#material","text":"The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Material"},{"location":"course_material/qc_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer Material Download the presentation Exercises 1. Retrieve data We will be working with data from: Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6. The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For the exercises of today, we will work with two samples of this study. Download and unpack the data files in your home directory. cd ~/project wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Exercise: This will create the directory called project1 . Check out what\u2019s in there. Answer Go to the ~/project/project1 folder: cd ~/project/project1 The data folder contains the following: project1/ \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 Cell_1.fastq.gz \u2502 \u251c\u2500\u2500 Cell_2.fastq.gz \u2502 \u251c\u2500\u2500 Cell_3.fastq.gz \u2502 \u251c\u2500\u2500 EV_1.fastq.gz \u2502 \u251c\u2500\u2500 EV_2.fastq.gz \u2502 \u2514\u2500\u2500 EV_3.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 references \u251c\u2500\u2500 Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa 2 directories, 9 files In the reads folder a fastq file with reads, which are described in reads_manifest.csv . EV means \u2018extracellular vesicle\u2019, Cell means \u2018entire cells\u2019. In the references folder you can find the reference sequence and annotation. 2. Quality control We will evaluate the read quality of two fastq files with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help . After that run NanoPlot on reads/Cell_2.fastq.gz reads/EV_2.fastq.gz . Your fastq files are in the \u2018rich\u2019 format, meaning they have additional information regarding the ONT run. Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq_rich . Answer We have a rich fastq file, so based on the manual and the example we can run: cd ~/project/project1 mkdir -p nanoplot NanoPlot \\ --fastq_rich reads/Cell_2.fastq.gz \\ --outdir nanoplot/Cell_2 NanoPlot \\ --fastq_rich reads/EV_2.fastq.gz \\ --outdir nanoplot/EV_2 In both directories you will now have a directory with the following files: . \u251c\u2500\u2500 ActivePores_Over_Time.html \u251c\u2500\u2500 ActivePores_Over_Time.png \u251c\u2500\u2500 ActivityMap_ReadsPerChannel.html \u251c\u2500\u2500 ActivityMap_ReadsPerChannel.png \u251c\u2500\u2500 CumulativeYieldPlot_Gigabases.html \u251c\u2500\u2500 CumulativeYieldPlot_Gigabases.png \u251c\u2500\u2500 CumulativeYieldPlot_NumberOfReads.html \u251c\u2500\u2500 CumulativeYieldPlot_NumberOfReads.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot_20240221_1219.log \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 NumberOfReads_Over_Time.html \u251c\u2500\u2500 NumberOfReads_Over_Time.png \u251c\u2500\u2500 TimeLengthViolinPlot.html \u251c\u2500\u2500 TimeLengthViolinPlot.png \u251c\u2500\u2500 TimeQualityViolinPlot.html \u251c\u2500\u2500 TimeQualityViolinPlot.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png 0 directories, 31 files The file NanoPlot-report.html contains a report with all the information stored in the other files, and NanoStats.txt in text format. Exercise: Check out some of the .png plots and the contents of NanoStats.txt . Also, download NanoPlot-report.html for both files to your local computer and answer the following questions: A. How many reads are in the files? B. What are the average read lengths? What does this tell us about the quality of both runs? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download\u2026 : Answer A. Cell_2: 49,808 reads; EV_2: 6,214 reads B. Cell_2: 1186.7 EV_2: 607.9. Both runs are form cDNA. Transcripts are usually around 1-2kb. The average read length is therefore quite for EV_2. C. The median base quality is for both around 12. This means that the error probability is about 10^(-12/10) = 0.06, so an accuracy of 94%. 3. Read alignment The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/project/project1 mkdir -p alignments for sample in EV_2 Cell_2 ; do minimap2 \\ -a \\ -x [ PARAMETER ] \\ -t 4 \\ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \\ reads/ \" $sample \" .fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/ \" $sample \" .bam ## indexing for IGV samtools index alignments/ \" $sample \" .bam done Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Modify the script to set the -x option: #!/usr/bin/env bash cd ~/project/project1 mkdir -p alignments for sample in EV_2 Cell_2 ; do minimap2 \\ -a \\ -x splice \\ -t 4 \\ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \\ reads/ \" $sample \" .fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/ \" $sample \" .bam ## indexing for IGV samtools index alignments/ \" $sample \" .bam done And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh 4. Visualisation Let\u2019s have a look at the alignments. Download the files (in ~/project/project1/alignments ): EV_2.bam EV_2.bam.bai Cell_2.bam Cell_2.bam.bai to your local computer and load the .bam files into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the gene ELOVL5 by typing the name into the search box. Do you see any evidence for alternative splicing already? How is the difference in quality between the two samples? Would that have an effect on estimating differential splicing? Check out the paper The authors found splice variants. Check figure 5B in the paper . Answer There is some observable exon skipping in Cell_2: The coverage of EV_2 is quite low. Also, a lot of the reads do not fully cover the gene. This will make it difficult to estimate differential splicing.","title":"QC and alignment"},{"location":"course_material/qc_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer","title":"Learning outcomes"},{"location":"course_material/qc_alignment/#material","text":"Download the presentation","title":"Material"},{"location":"course_material/qc_alignment/#exercises","text":"","title":"Exercises"},{"location":"course_material/qc_alignment/#1-retrieve-data","text":"We will be working with data from: Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6. The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For the exercises of today, we will work with two samples of this study. Download and unpack the data files in your home directory. cd ~/project wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Exercise: This will create the directory called project1 . Check out what\u2019s in there. Answer Go to the ~/project/project1 folder: cd ~/project/project1 The data folder contains the following: project1/ \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 Cell_1.fastq.gz \u2502 \u251c\u2500\u2500 Cell_2.fastq.gz \u2502 \u251c\u2500\u2500 Cell_3.fastq.gz \u2502 \u251c\u2500\u2500 EV_1.fastq.gz \u2502 \u251c\u2500\u2500 EV_2.fastq.gz \u2502 \u2514\u2500\u2500 EV_3.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 references \u251c\u2500\u2500 Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa 2 directories, 9 files In the reads folder a fastq file with reads, which are described in reads_manifest.csv . EV means \u2018extracellular vesicle\u2019, Cell means \u2018entire cells\u2019. In the references folder you can find the reference sequence and annotation.","title":"1. Retrieve data"},{"location":"course_material/qc_alignment/#2-quality-control","text":"We will evaluate the read quality of two fastq files with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help . After that run NanoPlot on reads/Cell_2.fastq.gz reads/EV_2.fastq.gz . Your fastq files are in the \u2018rich\u2019 format, meaning they have additional information regarding the ONT run. Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq_rich . Answer We have a rich fastq file, so based on the manual and the example we can run: cd ~/project/project1 mkdir -p nanoplot NanoPlot \\ --fastq_rich reads/Cell_2.fastq.gz \\ --outdir nanoplot/Cell_2 NanoPlot \\ --fastq_rich reads/EV_2.fastq.gz \\ --outdir nanoplot/EV_2 In both directories you will now have a directory with the following files: . \u251c\u2500\u2500 ActivePores_Over_Time.html \u251c\u2500\u2500 ActivePores_Over_Time.png \u251c\u2500\u2500 ActivityMap_ReadsPerChannel.html \u251c\u2500\u2500 ActivityMap_ReadsPerChannel.png \u251c\u2500\u2500 CumulativeYieldPlot_Gigabases.html \u251c\u2500\u2500 CumulativeYieldPlot_Gigabases.png \u251c\u2500\u2500 CumulativeYieldPlot_NumberOfReads.html \u251c\u2500\u2500 CumulativeYieldPlot_NumberOfReads.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot_20240221_1219.log \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 NumberOfReads_Over_Time.html \u251c\u2500\u2500 NumberOfReads_Over_Time.png \u251c\u2500\u2500 TimeLengthViolinPlot.html \u251c\u2500\u2500 TimeLengthViolinPlot.png \u251c\u2500\u2500 TimeQualityViolinPlot.html \u251c\u2500\u2500 TimeQualityViolinPlot.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png 0 directories, 31 files The file NanoPlot-report.html contains a report with all the information stored in the other files, and NanoStats.txt in text format. Exercise: Check out some of the .png plots and the contents of NanoStats.txt . Also, download NanoPlot-report.html for both files to your local computer and answer the following questions: A. How many reads are in the files? B. What are the average read lengths? What does this tell us about the quality of both runs? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download\u2026 : Answer A. Cell_2: 49,808 reads; EV_2: 6,214 reads B. Cell_2: 1186.7 EV_2: 607.9. Both runs are form cDNA. Transcripts are usually around 1-2kb. The average read length is therefore quite for EV_2. C. The median base quality is for both around 12. This means that the error probability is about 10^(-12/10) = 0.06, so an accuracy of 94%.","title":"2. Quality control"},{"location":"course_material/qc_alignment/#3-read-alignment","text":"The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/project/project1 mkdir -p alignments for sample in EV_2 Cell_2 ; do minimap2 \\ -a \\ -x [ PARAMETER ] \\ -t 4 \\ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \\ reads/ \" $sample \" .fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/ \" $sample \" .bam ## indexing for IGV samtools index alignments/ \" $sample \" .bam done Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Modify the script to set the -x option: #!/usr/bin/env bash cd ~/project/project1 mkdir -p alignments for sample in EV_2 Cell_2 ; do minimap2 \\ -a \\ -x splice \\ -t 4 \\ references/Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa \\ reads/ \" $sample \" .fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/ \" $sample \" .bam ## indexing for IGV samtools index alignments/ \" $sample \" .bam done And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh","title":"3. Read alignment"},{"location":"course_material/qc_alignment/#4-visualisation","text":"Let\u2019s have a look at the alignments. Download the files (in ~/project/project1/alignments ): EV_2.bam EV_2.bam.bai Cell_2.bam Cell_2.bam.bai to your local computer and load the .bam files into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the gene ELOVL5 by typing the name into the search box. Do you see any evidence for alternative splicing already? How is the difference in quality between the two samples? Would that have an effect on estimating differential splicing? Check out the paper The authors found splice variants. Check figure 5B in the paper . Answer There is some observable exon skipping in Cell_2: The coverage of EV_2 is quite low. Also, a lot of the reads do not fully cover the gene. This will make it difficult to estimate differential splicing.","title":"4. Visualisation"},{"location":"course_material/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker Exercises First login If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with ++ctrl+grave++. Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-longreads-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet Make a new directory Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"Server login"},{"location":"course_material/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker","title":"Learning outcomes"},{"location":"course_material/server_login/#exercises","text":"","title":"Exercises"},{"location":"course_material/server_login/#first-login","text":"If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with ++ctrl+grave++. Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:","title":"First login"},{"location":"course_material/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"course_material/server_login/#exercises_1","text":"","title":"Exercises"},{"location":"course_material/server_login/#first-login_1","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-longreads-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately.","title":"First login"},{"location":"course_material/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet","title":"A UNIX command line interface (CLI) refresher"},{"location":"course_material/server_login/#make-a-new-directory","text":"Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts","title":"Make a new directory"},{"location":"course_material/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"course_material/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"course_material/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"course_material/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"course_material/group_work/group_work/","text":"Learning outcomes After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline Introduction The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( /group_work/groupX ). You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory.","title":"Introduction"},{"location":"course_material/group_work/group_work/#learning-outcomes","text":"After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline","title":"Learning outcomes"},{"location":"course_material/group_work/group_work/#introduction","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.","title":"Introduction"},{"location":"course_material/group_work/group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"course_material/group_work/group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( /group_work/groupX ). You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory.","title":"Working directories"},{"location":"course_material/group_work/project1/","text":"Project 1: Differential isoform expression analysis of ONT data In this project, you will be working with data from the same resource as the data we have already worked on: Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6. It is Oxford Nanopore Technology sequencing data of cDNA from extracellular vesicles and whole cells. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 Cell_1.fastq.gz \u2502 \u251c\u2500\u2500 Cell_2.fastq.gz \u2502 \u251c\u2500\u2500 Cell_3.fastq.gz \u2502 \u251c\u2500\u2500 EV_1.fastq.gz \u2502 \u251c\u2500\u2500 EV_2.fastq.gz \u2502 \u2514\u2500\u2500 EV_3.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 references \u251c\u2500\u2500 Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa 2 directories, 9 files In the reads folder a fastq file with reads, which are described in reads_manifest.csv . EV means \u2018extracellular vesicle\u2019, Cell means \u2018entire cells\u2019. In the references folder you can find the reference sequence and annotation. Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis. Tasks & questions Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Project 1"},{"location":"course_material/group_work/project1/#project-1-differential-isoform-expression-analysis-of-ont-data","text":"In this project, you will be working with data from the same resource as the data we have already worked on: Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6. It is Oxford Nanopore Technology sequencing data of cDNA from extracellular vesicles and whole cells. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. You can add the group work directory to the workspace in VScode by opening the menu on the top right (hamburger symbol), click File > Add folder to workspace and type the path to the group work directory. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 Cell_1.fastq.gz \u2502 \u251c\u2500\u2500 Cell_2.fastq.gz \u2502 \u251c\u2500\u2500 Cell_3.fastq.gz \u2502 \u251c\u2500\u2500 EV_1.fastq.gz \u2502 \u251c\u2500\u2500 EV_2.fastq.gz \u2502 \u2514\u2500\u2500 EV_3.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 references \u251c\u2500\u2500 Homo_sapiens.GRCh38.111.chr5.chr6.chrX.gtf \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chr5.chr6.chrX.fa 2 directories, 9 files In the reads folder a fastq file with reads, which are described in reads_manifest.csv . EV means \u2018extracellular vesicle\u2019, Cell means \u2018entire cells\u2019. In the references folder you can find the reference sequence and annotation.","title":" Project 1: Differential isoform expression analysis of ONT data"},{"location":"course_material/group_work/project1/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis.","title":"Before you start"},{"location":"course_material/group_work/project1/#tasks-questions","text":"Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Tasks & questions"},{"location":"course_material/group_work/project2/","text":"Project 2: Repeat expansion analysis of PacBio data You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project2.tar.gz tar -xvf project2.tar.gz rm project2.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. It has the following directory structure: project2 \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u251c\u2500\u2500 reference \u2502 \u251c\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa \u2502 \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa.fai \u2514\u2500\u2500 targets \u2514\u2500\u2500 targets.bed 3 directories, 11 files The targets in gene1 and gene2 are described in targets/targets.bed . The columns in these .bed files describe the chromosome, start, end, and describe the motifs. To reduce computational load, the reference contains only chromosome 4 and X of the hg38 human reference genome. Tasks & questions Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to reference/Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Check out some of the bam files in IGV. How does that look? Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Use trgt to genotype the repeats. Basically, you want to know the expansion size of each repeat in each sample. Based on this, you can figure out which sample has abnormal expansions in which repeat. To run trgt read the manual . After the alignment, all required input files should be there. To visualize the output, use samtools to sort and index the bam file with the reads spanning the repeats (this is also explained in the manual - no need to run bcftools ). Run trvz to visualize the output. The allele plot should suffice. The visualization will give you a nice overview of the repeat expansions in the samples. Based on the different sizes of the repeat expansions, can you relate the repeat expansions to the disease phenotype? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Project 2"},{"location":"course_material/group_work/project2/#project-2-repeat-expansion-analysis-of-pacbio-data","text":"You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project2.tar.gz tar -xvf project2.tar.gz rm project2.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ . Only one group member has to do this. It has the following directory structure: project2 \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u251c\u2500\u2500 reference \u2502 \u251c\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa \u2502 \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa.fai \u2514\u2500\u2500 targets \u2514\u2500\u2500 targets.bed 3 directories, 11 files The targets in gene1 and gene2 are described in targets/targets.bed . The columns in these .bed files describe the chromosome, start, end, and describe the motifs. To reduce computational load, the reference contains only chromosome 4 and X of the hg38 human reference genome.","title":" Project 2: Repeat expansion analysis of PacBio data"},{"location":"course_material/group_work/project2/#tasks-questions","text":"Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to reference/Homo_sapiens.GRCh38.dna.primary_assembly.chrX.chr4.fa with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Check out some of the bam files in IGV. How does that look? Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Use trgt to genotype the repeats. Basically, you want to know the expansion size of each repeat in each sample. Based on this, you can figure out which sample has abnormal expansions in which repeat. To run trgt read the manual . After the alignment, all required input files should be there. To visualize the output, use samtools to sort and index the bam file with the reads spanning the repeats (this is also explained in the manual - no need to run bcftools ). Run trvz to visualize the output. The allele plot should suffice. The visualization will give you a nice overview of the repeat expansions in the samples. Based on the different sizes of the repeat expansions, can you relate the repeat expansions to the disease phenotype? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Tasks & questions"},{"location":"course_material/group_work/project3/","text":"Project 3: Assembly and annotation of bacterial genomes You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files Before you start You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka Tasks and questions Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Project 3"},{"location":"course_material/group_work/project3/#project-3-assembly-and-annotation-of-bacterial-genomes","text":"You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files","title":" Project 3: Assembly and annotation of bacterial genomes"},{"location":"course_material/group_work/project3/#before-you-start","text":"You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka","title":"Before you start"},{"location":"course_material/group_work/project3/#tasks-and-questions","text":"Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Tasks and questions"}]}
\ No newline at end of file
diff --git a/2024.3/sitemap.xml b/2024.3/sitemap.xml
index f4c1ca4..8439791 100644
--- a/2024.3/sitemap.xml
+++ b/2024.3/sitemap.xml
@@ -2,57 +2,57 @@
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
None
- 2024-02-19
+ 2024-02-21
daily
\ No newline at end of file
diff --git a/2024.3/sitemap.xml.gz b/2024.3/sitemap.xml.gz
index f5a4788..10e90ff 100644
Binary files a/2024.3/sitemap.xml.gz and b/2024.3/sitemap.xml.gz differ