From bd93580004e77fc1ed2f220cdaa3a5bf9ecfae10 Mon Sep 17 00:00:00 2001 From: Geert van Geest Date: Mon, 15 Jan 2024 16:38:08 +0100 Subject: [PATCH] Deployed 5cd6e91 to 2023.3 with MkDocs 1.4.1 and mike 1.1.2 --- 2023.3/index.html | 21 --------------------- 2023.3/search/search_index.json | 2 +- 2023.3/sitemap.xml.gz | Bin 204 -> 204 bytes 3 files changed, 1 insertion(+), 22 deletions(-) diff --git a/2023.3/index.html b/2023.3/index.html index d60162e..f8df943 100644 --- a/2023.3/index.html +++ b/2023.3/index.html @@ -250,13 +250,6 @@ Authors - - -
  • - - Material - -
  • @@ -566,13 +559,6 @@ Authors -
  • - -
  • - - Material - -
  • @@ -725,13 +711,6 @@

    Authors

  • -

    Material

    -

    Learning outcomes

    General learning outcomes

    After this course, you will be able to:

    diff --git a/2023.3/search/search_index.json b/2023.3/search/search_index.json index cc49673..03040e5 100644 --- a/2023.3/search/search_index.json +++ b/2023.3/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Material This website Zoom meeting (through mail) Google doc (through mail) Slack channel Learning outcomes General learning outcomes After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Home"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#material","text":"This website Zoom meeting (through mail) Google doc (through mail) Slack channel","title":"Material"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control and Read alignment 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:00 PM Applications & Group work Day 2 block start end subject block 1 9:15 AM 10:00 AM Applications - ONT 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Applicationns - PacBio block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:00 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control and Read alignment 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:00 PM Applications & Group work","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:00 AM Applications - ONT 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Applicationns - PacBio block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:00 PM Presentations","title":"Day 2"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"course_material/applications/","text":"Learning outcomes After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly Material Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Applications"},{"location":"course_material/applications/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly","title":"Learning outcomes"},{"location":"course_material/applications/#material","text":"Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Material"},{"location":"course_material/introduction/","text":"Learning outcomes After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations Material The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Introduction"},{"location":"course_material/introduction/#learning-outcomes","text":"After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations","title":"Learning outcomes"},{"location":"course_material/introduction/#material","text":"The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Material"},{"location":"course_material/qc_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer Material Download the presentation Exercises 1. Retrieve data We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence. 2. Quality control We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%. 3. Read alignment The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh 4. Visualisation Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"QC and alignment"},{"location":"course_material/qc_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer","title":"Learning outcomes"},{"location":"course_material/qc_alignment/#material","text":"Download the presentation","title":"Material"},{"location":"course_material/qc_alignment/#exercises","text":"","title":"Exercises"},{"location":"course_material/qc_alignment/#1-retrieve-data","text":"We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence.","title":"1. Retrieve data"},{"location":"course_material/qc_alignment/#2-quality-control","text":"We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%.","title":"2. Quality control"},{"location":"course_material/qc_alignment/#3-read-alignment","text":"The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh","title":"3. Read alignment"},{"location":"course_material/qc_alignment/#4-visualisation","text":"Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"4. Visualisation"},{"location":"course_material/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available. A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. Make a new directory Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l Loops 20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Server login"},{"location":"course_material/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting:","title":"Learning outcomes"},{"location":"course_material/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"course_material/server_login/#exercises","text":"","title":"Exercises"},{"location":"course_material/server_login/#first-login","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available.","title":"First login"},{"location":"course_material/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.","title":"A UNIX command line interface (CLI) refresher"},{"location":"course_material/server_login/#make-a-new-directory","text":"Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts","title":"Make a new directory"},{"location":"course_material/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"course_material/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"course_material/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"course_material/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"course_material/server_login/#loops","text":"20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Loops"},{"location":"course_material/group_work/group_work/","text":"Learning outcomes After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline Introduction The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~ ln -s /group_work/ ./ Now you can find your group directory at ~/ . Use this as much as possible. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Introduction"},{"location":"course_material/group_work/group_work/#learning-outcomes","text":"After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline","title":"Learning outcomes"},{"location":"course_material/group_work/group_work/#introduction","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.","title":"Introduction"},{"location":"course_material/group_work/group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"course_material/group_work/group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~ ln -s /group_work/ ./ Now you can find your group directory at ~/ . Use this as much as possible. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Working directories"},{"location":"course_material/group_work/project1/","text":"Project 1: Differential isoform expression analysis of ONT data In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis. Tasks & questions Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Project 1"},{"location":"course_material/group_work/project1/#project-1-differential-isoform-expression-analysis-of-ont-data","text":"In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz","title":" Project 1: Differential isoform expression analysis of ONT data"},{"location":"course_material/group_work/project1/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis.","title":"Before you start"},{"location":"course_material/group_work/project1/#tasks-questions","text":"Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Tasks & questions"},{"location":"course_material/group_work/project2/","text":"Project 2: Repeat expansion analysis of PacBio data You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/groupwork_pacbio.tar.gz tar -xvf groupwork_pacbio.tar.gz rm groupwork_pacbio.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. It has the following directory structure: groupwork_pacbio \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 bc1020.aln.bam \u2502 \u251c\u2500\u2500 bc1021.aln.bam \u2502 \u2514\u2500\u2500 bc1022.aln.bam \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u2514\u2500\u2500 targets \u251c\u2500\u2500 target_gene1_hg38.bed \u2514\u2500\u2500 target_gene2_hg38.bed 3 directories, 13 files The targets in gene1 and gene2 are described in targets/target_gene1_hg38.bed and targets/target_gene2_hg38.bed respectively. The columns in these .bed files describe the chromosome, start, end, name, motifs, and whether the motifs are in reverse complement. You can download the reference genome like this: cd groupwork_pacbio mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Browse IGV to find the genes Perform the QC with NanoPlot Perform the alignment with minimap2 Do the repeat analysis with makeReports.sh Alignment files to do an initial repeat analysis are in the tar.gz package. However, it contains only the files for individuals with disease2. You can develop scripts and analyses based on that. To do the full analysis, all the alignments will need to be run. Tasks & questions Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to hg38 with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Start the alignment as soon as possible The alignment takes quite some time. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x asm20 \\ -d reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x asm20 \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reads/ Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Clone the PacBio apps-scripts repository to the server. All dependencies are in the conda environment pacbio . Activate it with conda activate pacbio . The script apps-scripts/RepeatAnalysisTools/makeReports.sh generates repeat expansion reports. Check out the documentation , and generate repeat expansion reports for all individuals on both gene1 and gene2. Check out the report output and read the further documentation of RepeatAnalysisTools . How is the enrichment? Does the clustering make sense? How does the clustering look in IGV? Which individual is affected with which disease? Based on the size of the expansions, can you say something about expected disease severity? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Project 2"},{"location":"course_material/group_work/project2/#project-2-repeat-expansion-analysis-of-pacbio-data","text":"You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/groupwork_pacbio.tar.gz tar -xvf groupwork_pacbio.tar.gz rm groupwork_pacbio.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. It has the following directory structure: groupwork_pacbio \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 bc1020.aln.bam \u2502 \u251c\u2500\u2500 bc1021.aln.bam \u2502 \u2514\u2500\u2500 bc1022.aln.bam \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u2514\u2500\u2500 targets \u251c\u2500\u2500 target_gene1_hg38.bed \u2514\u2500\u2500 target_gene2_hg38.bed 3 directories, 13 files The targets in gene1 and gene2 are described in targets/target_gene1_hg38.bed and targets/target_gene2_hg38.bed respectively. The columns in these .bed files describe the chromosome, start, end, name, motifs, and whether the motifs are in reverse complement. You can download the reference genome like this: cd groupwork_pacbio mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz","title":" Project 2: Repeat expansion analysis of PacBio data"},{"location":"course_material/group_work/project2/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Browse IGV to find the genes Perform the QC with NanoPlot Perform the alignment with minimap2 Do the repeat analysis with makeReports.sh Alignment files to do an initial repeat analysis are in the tar.gz package. However, it contains only the files for individuals with disease2. You can develop scripts and analyses based on that. To do the full analysis, all the alignments will need to be run.","title":"Before you start"},{"location":"course_material/group_work/project2/#tasks-questions","text":"Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to hg38 with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Start the alignment as soon as possible The alignment takes quite some time. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x asm20 \\ -d reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x asm20 \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reads/ Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Clone the PacBio apps-scripts repository to the server. All dependencies are in the conda environment pacbio . Activate it with conda activate pacbio . The script apps-scripts/RepeatAnalysisTools/makeReports.sh generates repeat expansion reports. Check out the documentation , and generate repeat expansion reports for all individuals on both gene1 and gene2. Check out the report output and read the further documentation of RepeatAnalysisTools . How is the enrichment? Does the clustering make sense? How does the clustering look in IGV? Which individual is affected with which disease? Based on the size of the expansions, can you say something about expected disease severity? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Tasks & questions"},{"location":"course_material/group_work/project3/","text":"Project 3: Assembly and annotation of bacterial genomes You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files Before you start You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka Tasks and questions Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Project 3"},{"location":"course_material/group_work/project3/#project-3-assembly-and-annotation-of-bacterial-genomes","text":"You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files","title":" Project 3: Assembly and annotation of bacterial genomes"},{"location":"course_material/group_work/project3/#before-you-start","text":"You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka","title":"Before you start"},{"location":"course_material/group_work/project3/#tasks-and-questions","text":"Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Tasks and questions"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Learning outcomes General learning outcomes After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline Learning outcomes explained To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn. Learning experiences To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Home"},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Marco Kreuzer .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#learning-outcomes","text":"","title":"Learning outcomes"},{"location":"#general-learning-outcomes","text":"After this course, you will be able to: Describe the basics behind PacBio SMRT sequencing and Oxford Nanopore Technology sequencing Use the command line to perform quality control and read alignment of long-read sequencing data Develop and execute a bioinformatics pipeline to perform an alignment-based analysis Answer biological questions based on the analysis resulting from the pipeline","title":"General learning outcomes"},{"location":"#learning-outcomes-explained","text":"To reach the general learning outcomes above, we have set a number of smaller learning outcomes. Each chapter (found at Course material ) starts with these smaller learning outcomes. Use these at the start of a chapter to get an idea what you will learn. Use them also at the end of a chapter to evaluate whether you have learned what you were expected to learn.","title":"Learning outcomes explained"},{"location":"#learning-experiences","text":"To reach the learning outcomes we will use lectures, exercises, polls and group work. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Find the buttons in the participants list (\u2018Participants\u2019 button): Alternatively, (depending on your zoom version or OS) use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack","title":"Asking questions"},{"location":"course_schedule/","text":"Day 1 block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control and Read alignment 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:00 PM Applications & Group work Day 2 block start end subject block 1 9:15 AM 10:00 AM Applications - ONT 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Applicationns - PacBio block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:00 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject block 1 9:15 AM 10:30 AM Introduction 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Server login + unix fresh up 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control and Read alignment 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:00 PM Applications & Group work","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:00 AM Applications - ONT 10:00 AM 10:30 AM BREAK block 2 10:30 AM 11:30 PM Applicationns - PacBio block 3 11:30 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 4 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 5 3:30 PM 5:00 PM Presentations","title":"Day 2"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"course_material/applications/","text":"Learning outcomes After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly Material Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Applications"},{"location":"course_material/applications/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain for what kind of questions long-read sequencing technologies are more suitable compared to short-read sequencing technologies. Describe the basic steps that are required to perform a genome assembly","title":"Learning outcomes"},{"location":"course_material/applications/#material","text":"Download the presentation More on adaptive sampling More on Cas9 targeted sequencing ONT long-read-tools.org","title":"Material"},{"location":"course_material/introduction/","text":"Learning outcomes After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations Material The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Introduction"},{"location":"course_material/introduction/#learning-outcomes","text":"After having completed this chapter you will be able to: Illustrate the difference between short-read and long-read sequencing Explain which type of invention led to development of long-read sequencing Describe the basic techniques behind Oxford Nanopore sequencing and PacBio sequencing Choose based on the characteristics of the discussed sequencing platforms which one is most suited for different situations","title":"Learning outcomes"},{"location":"course_material/introduction/#material","text":"The introduction presentation: Download the presentation The sequencing technologies presentation: Download the presentation Nice review on long read sequencing in humans (also relevant for other species) Review on long read sequencing data analysis","title":"Material"},{"location":"course_material/qc_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer Material Download the presentation Exercises 1. Retrieve data We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence. 2. Quality control We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%. 3. Read alignment The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh 4. Visualisation Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"QC and alignment"},{"location":"course_material/qc_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain how the fastq format stores sequence and base quality information and why this is limited for long-read sequencing data Calculate base accuracy and probability based on base quality Describe how alignment information is stored in a sequence alignment ( .sam ) file Perform a quality control on long-read data with NanoPlot Perform a basic alignment of long reads with minimap2 Visualise an alignment file in IGV on a local computer","title":"Learning outcomes"},{"location":"course_material/qc_alignment/#material","text":"Download the presentation","title":"Material"},{"location":"course_material/qc_alignment/#exercises","text":"","title":"Exercises"},{"location":"course_material/qc_alignment/#1-retrieve-data","text":"We will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 The authors used full-transcript amplicon sequencing with Oxford Nanopore Technology of CACNA1C, a gene associated with psychiatric risk. For the exercises of today, we will work with a single sample of this study. Download and unpack the data files in your home directory. cd ~/workdir wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/ngs-longreads-training.tar.gz tar -xvf ngs-longreads-training.tar.gz rm ngs-longreads-training.tar.gz Exercise: This will create the directory data . Check out what\u2019s in there. Answer Go to the ~/workdir/data folder: cd ~/data The data folder contains the following: data/ \u251c\u2500\u2500 reads \u2502 \u2514\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2514\u2500\u2500 reference \u2514\u2500\u2500 Homo_sapiens.GRCh38.dna.chromosome.12.fa 2 directories, 2 files In the reads folder a fastq file with reads, in the reference folder the reference sequence.","title":"1. Retrieve data"},{"location":"course_material/qc_alignment/#2-quality-control","text":"We will evaluate the read quality with NanoPlot . Exercise: Check out the manual of NanoPlot with the command NanoPlot --help , and run NanoPlot on ~/data/reads/cerebellum-5238-batch2.fastq.gz . Hint For a basic output of NanoPlot on a fastq.gz file you can use the options --outdir and --fastq . Answer We have a fastq file, so based on the manual and the example we can run: cd ~/workdir NanoPlot \\ --fastq data/reads/cerebellum-5238-batch2.fastq.gz \\ --outdir nanoplot_output You will now have a directory with the following files: nanoplot_output \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_dot.png \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.html \u251c\u2500\u2500 LengthvsQualityScatterPlot_kde.png \u251c\u2500\u2500 NanoPlot-report.html \u251c\u2500\u2500 NanoPlot_20230309_1332.log \u251c\u2500\u2500 NanoStats.txt \u251c\u2500\u2500 Non_weightedHistogramReadlength.html \u251c\u2500\u2500 Non_weightedHistogramReadlength.png \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 Non_weightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 WeightedHistogramReadlength.html \u251c\u2500\u2500 WeightedHistogramReadlength.png \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.html \u251c\u2500\u2500 WeightedLogTransformed_HistogramReadlength.png \u251c\u2500\u2500 Yield_By_Length.html \u2514\u2500\u2500 Yield_By_Length.png The file NanoPlot-report.html contains a report with all the information stored in the other files. Exercise: Download NanoPlot-report.html to your local computer and answer the following questions: A. How many reads are in the file? B. What is the average read length? Is there a wide distribution? Given that these sequences are generated from a long-range PCR, is that expected? C. What is the average base quality and what kind of accuracy do we therefore expect? Download files from the notebook You can download files from the file browser, by right-clicking a file and selecting Download : Answer A. 3735 B. The average read length is 6,003.3 base pairs. From the read length histogram we can see that there is a very narrow distribution. As a PCR will generate sequences of approximately the same length, this is expected. C. The average base quality is 7.3. We have learned that \\(p=10^{\\frac{-baseQ}{10}}\\) , so the average probability that the base is wrong is \\(10^{\\frac{-7.3}{10}} = 0.186\\) . The expected accuracy is \\(1-0.186=0.814\\) or 81.4%.","title":"2. Quality control"},{"location":"course_material/qc_alignment/#3-read-alignment","text":"The sequence aligner minimap2 is specifically developed for (splice-aware) alignment of long reads. Exercise: Checkout the helper minimap2 --help and/or the github readme . We are working with reads generated from cDNA. Considering we are aligning to a reference genome (DNA), what would be the most logical parameter for our dataset to the option -x ? Answer The option -x can take the following arguments: -x STR preset (always applied before other options; see minimap2.1 for details) [] - map-pb/map-ont: PacBio/Nanopore vs reference mapping - ava-pb/ava-ont: PacBio/Nanopore read overlap - asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence - splice: long-read spliced alignment - sr: genomic short-read mapping We are working with ONT data so we could choose map-ont . However, our data is also spliced. Therefore, we should choose splice . Introns can be quite long in mammals; up to a few hundred kb. Exercise: Look up the CACNA1C gene in hg38 in IGV, and roughly estimate the length of the longest intron. Hint First load hg38 in IGV, by clicking the topleft drop-down menu: After that type CACNA1C in the search box: Answer The longest intron is about 350 kilo bases (350,000 base pairs) Exercise: Check out the -G option of minimap2 . How does this relate to the the largest intron size of CACNA1C? Answer This is what the manual says: -G NUM max intron length (effective with -xsplice; changing -r) [200k] We found an intron size of approximately 350k, so the default is set too small. We should be increase it to at least 350k. Exercise: Make a directory called alignments in your working directory. After that, modify the command below for minimap2 and run it from a script. #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam Note Once your script is running, it will take a while to finish. Have a \u2615. Answer Make a directory like this: mkdir ~/workdir/alignments Modify the script to set the -x and -G options: #!/usr/bin/env bash cd ~/workdir minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ data/reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa \\ data/reads/cerebellum-5238-batch2.fastq.gz \\ | samtools sort \\ | samtools view -bh > alignments/cerebellum-5238-batch2.bam ## indexing for IGV samtools index alignments/cerebellum-5238-batch2.bam And run it (e.g. if you named the script ont_alignment.sh ): chmod u+x ont_alignment.sh ./ont_alignment.sh","title":"3. Read alignment"},{"location":"course_material/qc_alignment/#4-visualisation","text":"Let\u2019s have a look at the alignments. Download the files cerebellum-5238-batch2.bam and cerebellum-5238-batch2.bam.bai to your local computer and load the .bam file into IGV ( File > Load from File\u2026 ). Exercise: Have a look at the region chr12:2,632,655-2,635,447 by typing it into the search box. Do you see any evidence for alternative splicing already? Answer The two exons seem to be mutually exclusive:","title":"4. Visualisation"},{"location":"course_material/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available. A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. Make a new directory Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l Loops 20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Server login"},{"location":"course_material/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.789.10:10002 ) in your browser. This should result in the following page: Type your password, and proceed to the notebook home page. This page contains all the files in your working directory (if there are any). Most of the exercises will be executed through the command line. We use the terminal for this. Find it at New > Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. You can generate and edit scripts with New > Text File : Once you have opened a script you can change the code highlighting. This is convenient for writing the code. The text editor will automatically change the highlighting based on the file extension (e.g. .py extension will result in python syntax highlighting). You can change or set the syntax highlighting by clicking the button on the bottom of the page. We will be using mainly shell scripting in this course, so here\u2019s an example for adjusting it to shell syntax highlighting:","title":"Learning outcomes"},{"location":"course_material/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"course_material/server_login/#exercises","text":"","title":"Exercises"},{"location":"course_material/server_login/#first-login","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -e JUPYTER_ENABLE_LAB = yes \\ -v /path/to/workingdir/:/home/jovyan \\ -p 8888 :8888 \\ geertvangeest/ngs-longreads-jupyter:latest \\ start-notebook.sh If this command has run successfully, you will find a link and token in the console, e.g.: http://127.0.0.1:8888/?token = 4be8d916e89afad166923de5ce5th1s1san3xamp13 Copy this URL into your browser, and you will be able to use the jupyter notebook. The option -v mounts a local directory in your computer to the directory /home/jovyan in the docker container (\u2018jovyan\u2019 is the default user for jupyter containers). In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-longreads-jupyter:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. You can build the environment from ngs-longreads.yml Generate the conda environment like this: conda env create --name ngs-longreads -f ngs-longreads.yml The yaml file probably only works for Linux systems If you want to use the conda environment on a different OS, use: conda create -n ngs-longreads python = 3 .6 conda activate ngs-longreads conda install -y -c bioconda \\ samtools \\ minimap2 \\ fastqc \\ pbmm2 \\ conda install -y -c bioconda nanoplot If the installation of NanoPlot fails, try to install it with pip : pip install NanoPlot This will create the conda environment ngs-longreads Activate it like so: conda activate ngs-longreads After successful installation and activating the environment all the software required to do the exercises should be available.","title":"First login"},{"location":"course_material/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.","title":"A UNIX command line interface (CLI) refresher"},{"location":"course_material/server_login/#make-a-new-directory","text":"Login to the server and use the command line to make a directory called workdir . If working with Docker If your are working with docker you are a root user. This means that your \u201chome\u201d directory is the root directory, i.e. /root , and not /home/username . If you have mounted your local directory to /root/workdir , this directory should already exist. Answer cd mkdir workdir Make a directory scripts within ~/workdir and make it your current directory. Answer cd workdir mkdir scripts cd scripts","title":"Make a new directory"},{"location":"course_material/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/workdir/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"course_material/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory ( ~/workdir ; use ls and > ). Answer ls / > ~/workdir/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"course_material/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer cd ~/workdir FILE = system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"course_material/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"course_material/server_login/#loops","text":"20 minutes If you want to run the same command on a range of arguments, it\u2019s not very convenient to type the command for each individual argument. For example, you could write dog , fox , bird to stdout in a script like this: #!/usr/bin/env bash echo dog echo fox echo bird However, if you want to change the command (add an option for example), you would have to change it for all the three command calls. Amongst others for that reason, you want to write the command only once. You can do this with a for-loop, like this: #!/usr/bin/env bash ANIMALS = \"dog fox bird\" for animal in $ANIMALS do echo $animal done Which results in: dog fox bird Write a shell script that removes all the letters \u201ce\u201d from a list of words. Hint Removing the letter \u201ce\u201d from a string can be done with tr like this: word = \"test\" echo $word | tr -d \"e\" Which would result in: tst Answer Your script should e.g. look like this (I\u2019ve added some awesome functionality): #!/usr/bin/env bash WORDLIST = \"here is a list of words resulting in a sentence\" for word in $WORDLIST do echo \"' $word ' with e's removed looks like:\" echo $word | tr -d \"e\" done resulting in: 'here' with e's removed looks like: hr 'is' with e's removed looks like: is 'a' with e's removed looks like: a 'list' with e's removed looks like: list 'of' with e's removed looks like: of 'words' with e's removed looks like: words 'resulting' with e's removed looks like: rsulting 'in' with e's removed looks like: in 'a' with e's removed looks like: a 'sentence' with e's removed looks like: sntnc","title":"Loops"},{"location":"course_material/group_work/group_work/","text":"Learning outcomes After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline Introduction The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~ ln -s /group_work/ ./ Now you can find your group directory at ~/ . Use this as much as possible. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Introduction"},{"location":"course_material/group_work/group_work/#learning-outcomes","text":"After having completed this chapter you will be able to: Develop a basic pipeline for alignment-based analysis of a long-read sequencing dataset Answer biological questions based on the analysis resulting from the pipeline","title":"Learning outcomes"},{"location":"course_material/group_work/group_work/#introduction","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working independently If you are working independently, you probably can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 10G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.","title":"Introduction"},{"location":"course_material/group_work/group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. At the start of the project, make sure that each of you gets a task assigned. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will divide the initial tasks, and start on the project. On day 2, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"course_material/group_work/group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~ ln -s /group_work/ ./ Now you can find your group directory at ~/ . Use this as much as possible. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Working directories"},{"location":"course_material/group_work/project1/","text":"Project 1: Differential isoform expression analysis of ONT data In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis. Tasks & questions Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Project 1"},{"location":"course_material/group_work/project1/#project-1-differential-isoform-expression-analysis-of-ont-data","text":"In this project, you will be working with data from the same resource as the data we have already worked on: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . It is Oxford Nanopore Technology sequencing data of PCR amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. We will use the dataset to do that and in addition do a differential isoform expression analysis with FLAIR . Project aim Discover new splice variants and identify differentially expressed isoforms. You can download the required data like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project1 with the following structure: project1/ \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.bam \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.bam \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.bam \u2502 \u2514\u2500\u2500 parietal_cortex-5346-batch1.bam \u251c\u2500\u2500 counts \u2502 \u2514\u2500\u2500 counts_matrix_test.tsv \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 cerebellum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5298-batch2.fastq.gz \u2502 \u251c\u2500\u2500 cerebellum-5346-batch2.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5238-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5298-batch1.fastq.gz \u2502 \u251c\u2500\u2500 parietal_cortex-5346-batch1.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5238-batch2.fastq.gz \u2502 \u251c\u2500\u2500 striatum-5298-batch2.fastq.gz \u2502 \u2514\u2500\u2500 striatum-5346-batch2.fastq.gz \u251c\u2500\u2500 reads_manifest.tsv \u2514\u2500\u2500 scripts \u2514\u2500\u2500 differential_expression_example.Rmd 4 directories, 18 files Download the fasta file and gtf like this: cd project1/ mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz gunzip *.gz","title":" Project 1: Differential isoform expression analysis of ONT data"},{"location":"course_material/group_work/project1/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Quality control, running fastqc and NanoPlot Alignment, running minimap2 Develop scripts required to run FLAIR Differential expression analysis.","title":"Before you start"},{"location":"course_material/group_work/project1/#tasks-questions","text":"Perform QC with fastqc and with NanoPlot . Do you see a difference between them? How is the read quality compared to the publication? Align each sample separately with minimap2 with default parameters. Set parameters -x and -G to the values we have used during the QC and alignment exercises . You can use 4 threads (set the number of threads with -t ) Start the alignment as soon as possible The alignment takes about 6 minutes per sample, so in total about one hour to run. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x splice \\ -d reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x splice \\ -G 500k \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.chromosome.12.fa.mmi \\ reads/ Have a look at the FLAIR documentation . FLAIR and all its dependencies are in the the pre-installed conda environment named flair . You can activate it with conda activate flair . Merge the separate alignments with samtools merge , index the merged bam file, and generate a bed12 file with the command bam2Bed12 Run flair correct on the bed12 file. Add the gtf to the options to improve the alignments. Run flair collapse to generate isoforms from corrected reads. This steps takes ~1.5 hours to run. Generate a count matrix with flair quantify by using the isoforms fasta and reads_manifest.tsv (takes ~45 mins to run). Paths in reads_manifest.tsv The paths in reads_manifest.tsv are relative, e.g. reads/striatum-5238-batch2.fastq.gz points to a file relative to the directory from which you are running flair quantify . So the directory from which you are running the command should contain the directory reads . If not, modify the paths in the file accordingly (use full paths if you are not sure). Now you can do several things: Do a differential expression analysis. In scripts/ there\u2019s a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is rstudio ). Investigate the isoform usage with the flair script plot_isoform_usage.py Investigate productivity of the different isoforms.","title":"Tasks & questions"},{"location":"course_material/group_work/project2/","text":"Project 2: Repeat expansion analysis of PacBio data You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/groupwork_pacbio.tar.gz tar -xvf groupwork_pacbio.tar.gz rm groupwork_pacbio.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. It has the following directory structure: groupwork_pacbio \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 bc1020.aln.bam \u2502 \u251c\u2500\u2500 bc1021.aln.bam \u2502 \u2514\u2500\u2500 bc1022.aln.bam \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u2514\u2500\u2500 targets \u251c\u2500\u2500 target_gene1_hg38.bed \u2514\u2500\u2500 target_gene2_hg38.bed 3 directories, 13 files The targets in gene1 and gene2 are described in targets/target_gene1_hg38.bed and targets/target_gene2_hg38.bed respectively. The columns in these .bed files describe the chromosome, start, end, name, motifs, and whether the motifs are in reverse complement. You can download the reference genome like this: cd groupwork_pacbio mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Before you start You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Browse IGV to find the genes Perform the QC with NanoPlot Perform the alignment with minimap2 Do the repeat analysis with makeReports.sh Alignment files to do an initial repeat analysis are in the tar.gz package. However, it contains only the files for individuals with disease2. You can develop scripts and analyses based on that. To do the full analysis, all the alignments will need to be run. Tasks & questions Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to hg38 with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Start the alignment as soon as possible The alignment takes quite some time. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x asm20 \\ -d reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x asm20 \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reads/ Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Clone the PacBio apps-scripts repository to the server. All dependencies are in the conda environment pacbio . Activate it with conda activate pacbio . The script apps-scripts/RepeatAnalysisTools/makeReports.sh generates repeat expansion reports. Check out the documentation , and generate repeat expansion reports for all individuals on both gene1 and gene2. Check out the report output and read the further documentation of RepeatAnalysisTools . How is the enrichment? Does the clustering make sense? How does the clustering look in IGV? Which individual is affected with which disease? Based on the size of the expansions, can you say something about expected disease severity? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Project 2"},{"location":"course_material/group_work/project2/#project-2-repeat-expansion-analysis-of-pacbio-data","text":"You will be working with data from an experiment in which DNA of 8 individuals was sequenced for five different targets by using Pacbio\u2019s no-Amp targeted sequencing system. Two of these targets contain repeat expansions that are related to a disease phenotype. Project aim Estimate variation in repeat expansions in two target regions, and relate them to a disease phenotype. individual disease1 disease2 1015 disease healthy 1016 disease healthy 1017 disease healthy 1018 disease healthy 1019 healthy healthy 1020 healthy disease 1021 healthy disease 1022 healthy disease You can get the reads and sequence targets with: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/groupwork_pacbio.tar.gz tar -xvf groupwork_pacbio.tar.gz rm groupwork_pacbio.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. It has the following directory structure: groupwork_pacbio \u251c\u2500\u2500 alignments \u2502 \u251c\u2500\u2500 bc1020.aln.bam \u2502 \u251c\u2500\u2500 bc1021.aln.bam \u2502 \u2514\u2500\u2500 bc1022.aln.bam \u251c\u2500\u2500 reads \u2502 \u251c\u2500\u2500 1015.fastq.gz \u2502 \u251c\u2500\u2500 1016.fastq.gz \u2502 \u251c\u2500\u2500 1017.fastq.gz \u2502 \u251c\u2500\u2500 1018.fastq.gz \u2502 \u251c\u2500\u2500 1019.fastq.gz \u2502 \u251c\u2500\u2500 1020.fastq.gz \u2502 \u251c\u2500\u2500 1021.fastq.gz \u2502 \u2514\u2500\u2500 1022.fastq.gz \u2514\u2500\u2500 targets \u251c\u2500\u2500 target_gene1_hg38.bed \u2514\u2500\u2500 target_gene2_hg38.bed 3 directories, 13 files The targets in gene1 and gene2 are described in targets/target_gene1_hg38.bed and targets/target_gene2_hg38.bed respectively. The columns in these .bed files describe the chromosome, start, end, name, motifs, and whether the motifs are in reverse complement. You can download the reference genome like this: cd groupwork_pacbio mkdir reference cd reference wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz","title":" Project 2: Repeat expansion analysis of PacBio data"},{"location":"course_material/group_work/project2/#before-you-start","text":"You can start this project with dividing initial tasks. Because some intermediate files are already given, participants can develop scripts/analyses at different steps of the full analysis from the start. Possible starting points are: Browse IGV to find the genes Perform the QC with NanoPlot Perform the alignment with minimap2 Do the repeat analysis with makeReports.sh Alignment files to do an initial repeat analysis are in the tar.gz package. However, it contains only the files for individuals with disease2. You can develop scripts and analyses based on that. To do the full analysis, all the alignments will need to be run.","title":"Before you start"},{"location":"course_material/group_work/project2/#tasks-questions","text":"Load the bed files into IGV and navigate to the regions they annotate. In which genes are the targets? What kind of diseases are associated with these genes? Perform a quality control with NanoPlot . How is the read quality? These are circular concensus sequences (ccs). Is this quality expected? How is the read length? Align the reads to hg38 with minimap2 . For the option -x you can use asm20 . Generate separate alignment files for each individual. Start the alignment as soon as possible The alignment takes quite some time. Try to start the alignment as soon as possible. You can speed up your alignment by first making an index, e.g.: minimap2 \\ -x asm20 \\ -d reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa Refer to the generated index ( .mmi file) as reference in the alignment command, e.g.: minimap2 \\ -a \\ -x asm20 \\ -t 4 \\ reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmi \\ reads/ Alternatively use pbmm2 Pacific Biosciences has developed a wrapper for minimap2 that contains settings specific for PacBio reads, named pbmm2 . It might slightly improve your alignments. It is installed in the conda environment. Feel free to give it a try if you have time left. Clone the PacBio apps-scripts repository to the server. All dependencies are in the conda environment pacbio . Activate it with conda activate pacbio . The script apps-scripts/RepeatAnalysisTools/makeReports.sh generates repeat expansion reports. Check out the documentation , and generate repeat expansion reports for all individuals on both gene1 and gene2. Check out the report output and read the further documentation of RepeatAnalysisTools . How is the enrichment? Does the clustering make sense? How does the clustering look in IGV? Which individual is affected with which disease? Based on the size of the expansions, can you say something about expected disease severity? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/RepeatExpansionDisorders_NoAmp/","title":"Tasks & questions"},{"location":"course_material/group_work/project3/","text":"Project 3: Assembly and annotation of bacterial genomes You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files Before you start You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka Tasks and questions Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Project 3"},{"location":"course_material/group_work/project3/#project-3-assembly-and-annotation-of-bacterial-genomes","text":"You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. Project aim Generate and evaluate an assembly of a bacterial genome out of PacBio reads. There are eight different species: sample_[1-8].fastq.gz Each species has a fastq file available. You can download all fastq files like this: wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Note Download the data file package in your shared working directory, i.e. : /group_work/ or ~/ . Only one group member has to do this. This will create a directory project3 with the following structure: project3 |-- sample_1.fastq.gz |-- sample_2.fastq.gz |-- sample_3.fastq.gz |-- sample_4.fastq.gz |-- sample_5.fastq.gz |-- sample_6.fastq.gz |-- sample_7.fastq.gz `-- sample_8.fastq.gz 0 directories, 8 files","title":" Project 3: Assembly and annotation of bacterial genomes"},{"location":"course_material/group_work/project3/#before-you-start","text":"You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: Quality control with NanoPlot Assembly with flye Assembly QC with BUSCO Annotation with prokka","title":"Before you start"},{"location":"course_material/group_work/project3/#tasks-and-questions","text":"Note You have four cores available. Use them! For most tools you can specificy the number of cores/cpus as an argument. Note All require software can be found in the conda environment assembly . Load it like this: conda activate assembly Perform a quality control with NanoPlot . How is the read quality? Is this quality expected? How is the read length? Perform an assembly with flye . Have a look at the helper first with flye --help . Make sure you pick the correct mode (i.e. --pacbio-?? ). Check out the output. Where is the assembly? How is the quality? For that, check out assembly_info.txt . What species did you assemble? Choose from this list: Acinetobacter baumannii Bacillus cereus Bacillus subtilis Burkholderia cepacia Burkholderia multivorans Enterococcus faecalis Escherichia coli Helicobacter pylori Klebsiella pneumoniae Listeria monocytogenes Methanocorpusculum labreanum Neisseria meningitidis Rhodopseudomonas palustris Salmonella enterica Staphylococcus aureus Streptococcus pyogenes Thermanaerovibrio acidaminovorans Treponema denticola Vibrio parahaemolyticus Did flye assemble any plasmid sequences? Check the completeness with BUSCO . Have a good look at the manual first. You can use automated lineage selecton by specifying --auto-lineage-prok . After you have run BUSCO , you can generate a nice completeness plot with generate_plot.py . You can check its usage with generate_plot.py --help . How is the completeness? Is this expected? Perform an annotation with prokka . Again, check the manual first. After the run, have a look at for example the statistics in PROKKA_[date].txt . For a nice table of annotated genes have a look in PROKKA_[date].tsv . Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/","title":"Tasks and questions"}]} \ No newline at end of file diff --git a/2023.3/sitemap.xml.gz b/2023.3/sitemap.xml.gz index 777929d239cf31a49794fad27338c541fde98d3a..9ecac78f93df13c2c34d35a9da8e6b1fdd05611c 100644 GIT binary patch delta 14 VcmX@Zc!rTpzMF$1A#ftwF#sT%1Ze;O delta 14 VcmX@Zc!rTpzMF&NxZ^~&V*nyu1i%0Q