-
Notifications
You must be signed in to change notification settings - Fork 0
/
ref_based_rna-seq.yaml
1552 lines (1532 loc) · 65.5 KB
/
ref_based_rna-seq.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
id: ref-based
name: Reference-based RNA-Seq data analysis
description: >-
In this tutorial we will align against a reference genome, Drosophila
melanogaster, to significantly improve the ability to reconstruct transcripts
and then identify differences of expression between several conditions.
title_default: peptide-protein-id
tags:
- "RNA"
steps:
- title: Introduction
content: >-
In this tutorial we will align against a reference genome, Drosophila
melanogaster, to significantly improve the ability to reconstruct
transcripts and then identify differences of expression between several
conditions.
backdrop: true
- title: Introduction
content: >-
In the study of <a
href="http://genome.cshlp.org/content/21/2/193.long">Brooks et al.
2011</a>, the <i>Pasilla (PS)</i> gene, <i>Drosophila</i> homologue of the
Human splicing regulators Nova-1 and Nova-2 Proteins, was depleted in
<i>Drosophila melanogaster</i> by RNAi. The authors wanted to identify
exons that are regulated by <i>Pasilla</i> gene using RNA sequencing
data.
backdrop: true
- title: Introduction
content: >-
Total RNA was isolated and used for preparing either single-end or
paired-end RNA-seq libraries for treated (PS depleted) samples and
untreated samples. These libraries were sequenced to obtain a collection
of RNA sequencing reads for each sample. The effects of <i>Pasilla</i>
gene depletion on splicing events can then be analyzed by comparison of
RNA sequencing data of the treated (PS depleted) and the untreated
samples. <br><br>The genome of <i>Drosophila melanogaster</i> is known and
assembled. It can be used as reference genome to ease this analysis. In a
reference based RNA-seq data analysis, the reads are aligned (or mapped)
against a reference genome, <i>Drosophila melanogaster</i> here, to
significantly improve the ability to reconstruct transcripts and then
identify differences of expression between several conditions.
backdrop: true
- title: Data upload
content: >-
The original data is available at NCBI Gene Expression Omnibus (GEO) under
accession number <a
href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508">GSE18508</a>.
<br><br>We will look at the 7 first samples:
<ul>
<li>3 treated samples with <i>Pasilla</i> (PS) gene depletion: <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461179">GSM461179</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461180">GSM461180</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461181">GSM461181</a></li>
<li>4 untreated samples: <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461176">GSM461176</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461177">GSM461177</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461178">GSM461178</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461182">GSM461182</a></li>
</ul>
<br>Each sample constitutes a separate biological replicate of the
corresponding condition (treated or untreated). Moreover, two of the
treated and two of the untreated samples are from a paired-end sequencing
assay, while the remaining samples are from a single-end sequencing
experiment.<br><br> We have extracted sequences from the Sequence Read
Archive (SRA) files to build FASTQ files.
backdrop: true
- title: History options
element: '#history-options-button'
content: >-
Create a new history for this RNA-seq exercise. Click on this button and
then "Create New"
placement: left
- title: Importing data via links
content: >-
Import files from <a href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a>.
backdrop: true
- title: Uploading the new data
element: '#tool-panel-upload-button .fa.fa-upload'
content: We need to upload data. Open the Galaxy Upload Manager
placement: right
postclick:
- '#tool-panel-upload-button .fa.fa-upload'
- '#btn-reset'
- title: Uploading the input data
element: '#btn-new'
content: Click on Paste/Fetch Data
placement: right
postclick:
- '#btn-new'
- title: Uploading the input data
element: .upload-text-column .upload-text .upload-text-content.form-control
content: Load the data into your history by providing the links
placement: right
textinsert: |-
https://zenodo.org/record/1185122/files/GSM461177_1.fastqsanger
https://zenodo.org/record/1185122/files/GSM461177_2.fastqsanger
https://zenodo.org/record/1185122/files/GSM461180_1.fastqsanger
https://zenodo.org/record/1185122/files/GSM461180_2.fastqsanger
backdrop: false
- title: Uploading the input data
element: '#btn-start'
content: Click on "Start" to start loading the data to history
placement: right
postclick:
- '#btn-start'
- title: Uploading the input data
element: '#btn-close'
content: >-
The upload may take a while.<br> Hit the close button to close this
window.
placement: right
postclick:
- '#btn-close'
- title: Rename the input data
element: '.history-right-panel .list-items > *:first'
content: >-
The uploaded datasets is in the history, but its name corresponds to the
link. We want to rename them it to something more meaningful<br><br> <ul>
<li>Click on the pencil icon beside the file to "Edit Attributes".</li>
<li>Change the "<b>Name:</b>" accordingly.</li>
<li>Make sure "<b>datatype"</b> is set to "fastqsanger"</li>
</ul>
position: left
- title: Adding a tag
element: '.history-right-panel .list-items > *:first'
content: >-
In order to each database a tag corresponding to the name of the sample
(`#GSM461177` or `#GSM461180`)
<ul>
<li>Click on the dataset</li>
<li>Click on <b>Edit dataset tags</b></li>
<li>Add the tag starting with `#`</li>
</ul>
position: left
- title: Quality control
content: >-
The sequences are raw data from the sequencing machine, without any
pretreatments. They need to be assessed for their quality.<br><br>
For quality control, we use similar tools as described in <a
href="http://galaxyproject.github.io/training-material/topics/sequence-analysis">NGS-QC
tutorial</a>: <a
href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a>
and <a
href="https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/">Trim
Galore</a>.
backdrop: true
- title: Quality control
element: '#tool-search-query'
content: Search for 'FastQC' tool.
placement: right
textinsert: FastQC
- title: Quality control
element: '#tool-search'
content: Click on the 'FastQC' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fdevteam%2Ffastqc%2Ffastqc%2F0.69"]
.tool-old-link
- title: Quality control
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Short read data from your current history" to `Multiple datasets`</li>
</ul>
position: right
- title: Quality control
element: '.history-right-panel .list-items > *:first'
content: Inspect on the generated webpage for GSM461177_1 sample.
position: left
- title: Questions
content: |-
<ul>
<li>What is the read length?</li>
</ul>
backdrop: false
- title: Quality control
element: '#tool-search-query'
content: Search for 'MultiQC' tool.
placement: right
textinsert: MultiQC
- title: Quality control
element: '#tool-search'
content: Click on the 'MultiQC' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
- title: Quality control
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Which tool was used generate logs?" to `FastQC`</li>
<li>Type of FastQC output?" to `Raw data`</li>
<li>"FastQC output" to the generated Raw data files (multiple datasets)</li>
</ul>
position: right
- title: Quality control
element: '.history-right-panel .list-items > *:first'
content: Inspect the webpage output from MultiQC
position: left
- title: Questions
content: |-
<ul>
<li>What is the quality for the sequences for the different files?</li>
</ul>
backdrop: false
- title: Quality control
element: '#tool-search-query'
content: Search for 'Trim Galore' tool.
placement: right
textinsert: Trim Galore
- title: Quality control
element: '#tool-search'
content: Click on the 'Trim Galore' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftrim_galore%2Ftrim_galore%2F0.4.3.1"]
.tool-old-link
- title: Quality control
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Is this library paired- or single-end?" to `Paired-end`</li>
<li>First "Reads in FASTQ format" to both `_1` fastqsanger datasets (multiple datasets)</li>
<li>Second "Reads in FASTQ format" to both `_2` fastqsanger datasets (multiple datasets)</li>
</ul>
position: right
- title: Questions
content: |-
<ul>
<li>Why do we run Trim Galore! only once on a paired-end dataset and not twice, once for each dataset?</li>
</ul>
backdrop: false
- title: Mapping
content: >-
As the genome of <i>Drosophila melanogaster</i> is known and assembled, we
can use this information and map the sequences on this genome to identify
the effects of <i>Pasilla</i> gene depletion on splicing events.<br><br>
To make sense of the reads, we need to determine to which genes
they belong. The first step is to determine their positions within the
<i>Drosophila melanogaster</i> genome. This process is known as aligning
or ‘mapping’ the reads to a reference.<br><br>
Because in the case of a eukaryotic transcriptome, most reads
originate from processed mRNAs lacking introns, they cannot be simply
mapped back to the genome as we normally do for DNA data. Instead the
reads must be separated into two categories:
<ul>
<li>Reads that map entirely within exons</li>
<li>Reads that cannot be mapped within an exon across their entire length because they span two or more exons</li>
</ul>
backdrop: true
- title: Mapping
element: '#tool-search-query'
content: Search for 'RNA STAR' tool.
placement: right
textinsert: RNA STAR
- title: Mapping
element: '#tool-search'
content: Click on the 'RNA STAR' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Frgrnastar%2Frna_star%2F2.5.2b-0"]
.tool-old-link
- title: Mapping
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Single-end or paired-end reads" to `Paired-end (as individual datasets)`</li>
<li>"RNA-Seq FASTQ/FASTA file, forward reads" to the generated `trimmed reads pair 1` files (multiple datasets)</li>
<li>"RNA-Seq FASTQ/FASTA file, reverse reads" to the generated `trimmed reads pair 2` files (multiple datasets)</li>
<li>"Custom or built-in reference genome" to `Use a built-in index`</li>
<li>"Reference genome with or without an annotation" to `use genome reference without builtin gene-model`</li>
<li>"Select reference genome" to `Drosophila Melanogaster (dm6)`</li>
<li>"Gene model (gff3,gtf) file for splice junctions" to the imported `Drosophila_melanogaster.BDGP6.87.gtf`</li>
<li>"Length of the genomic sequence around annotated junctions" to `36`</li></ul>
</ul>
position: right
- title: Mapping
element: '#tool-search-query'
content: Search for 'MultiQC' tool.
placement: right
textinsert: MultiQC
- title: Mapping
element: '#tool-search'
content: Click on the 'MultiQC' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
.tool-old-link
- title: Mapping
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Which tool was used generate logs?" to `STAR`</li>
<li>"Type of FastQC output?" to `Log`</li>
<li>"STAR log output" to the generated `log` files (multiple datasets)</li>
</ul>
position: right
- title: Questions
content: |-
<ul>
<li>Which percentage of reads were mapped exactly once for both samples?</li>
<li>What is a BAM file?</li>
<li>What does such a file contain?</li>
</ul>
backdrop: false
- title: Inspection of the mapping results with IGV
content: >-
The BAM file contains information about where the reads are mapped on the
reference genome. But it is a binary file and with the information for
more than 3 million reads encoded in it, it is difficult to inspect and
explore the file.
<br><br>A powerful tool to visualize the content of BAM files is the
Integrative Genomics Viewer IGV.
backdrop: true
- title: Inspection of the mapping results with IGV
element: '.history-right-panel .list-items > *:first'
content: |-
<ul>Visualize the aligned reads for `GSM461177`
<li>Click on the STAR BAM output in your history to expand it.</li>
<li>Towards the bottom of the history item, find the line starting with `Display with IGV`</li>
</ul>
position: left
- title: Inspection of the mapping results with IGV
content: 'Zoom to `chr4:540,000-560,000` (Chromosome 4 between 540 kb to 560 kb)'
backdrop: false
- title: Questions
content: |-
<ul>
<li>Which information does appear on the top in grey?</li>
<li>What do the connecting lines between some of the aligned reads indicate?</li>
</ul>
backdrop: false
- title: Creation of a Sashimi plot
content: |-
<ul>
<li>Right click on the BAM file</li>
<li>Select Sashimi Plot from the context menu</li>
</ul>
backdrop: false
- title: Questions
content: |-
<ul>
<li>What does the vertical bar graph represent? And the numbered arcs?</li>
<li>What do the numbers on the arcs mean?</li>
<li>Why do we observe different stacked groups of blue linked boxes at the bottom?</li>
</ul>
backdrop: false
- title: Aftermath
content: >-
After the mapping, we have the information on where the reads are located
on the reference genome. We also know how well they were mapped.<br><br>
The next step in the RNA-Seq data analysis is quantification of expression
level of the genomic features (gene, transcript, exons, …) to be able then
to compare several samples for the different expression analysis. The
quantification consist into taking each known genomic feature (e.g. gene)
of the reference genome and then counting how many reads are mapped on
this genomic feature. So, in this step, we start with an information per
mapped reads to end with an information per genomic feature.
<br<br>To identify exons that are regulated by the <i>Pasilla</i> gene, we
need to identify genes and exons which are differentially expressed
between samples with PS gene depletion and control samples. In this
tutorial, we will then analyze the differential gene expression, but also
the differential exon usage.
backdrop: true
- title: Aftermath
content: >-
To identify exons that are regulated by the Pasilla gene, we need to
identify genes and exons which are differentially expressed between
samples with PS gene depletion and control samples. In this tutorial, we
will then analyze the differential gene expression, but also the
differential exon usage.
backdrop: true
- title: Aftermath
content: >-
We will first investigate the differential gene expression to identify
which genes are impacted by the <i>Pasilla</i> gene depletion.
<br><br>To compare the expression of single genes between different
conditions (e.g. with or without PS depletion), an essential first step is
to quantify the number of reads per gene.<br><br>
Two main tools could be used for that: <a
href='http://htseq.readthedocs.io/en/release_0.9.1/count.html'>HTSeq-count</a>
(<a
href='https://academic.oup.com/bioinformatics/article/31/2/166/2366196'>Anders
et al, Bioinformatics, 2015</a>) or featureCounts (<a
href='https://academic.oup.com/bioinformatics/article/31/2/166/2366196'>Liao
et al, Bioinformatics, 2014</a>). The second one is considerably faster
and requires far less computational resources. We will use it.
backdrop: true
- title: Estimation of the strandness
content: >-
RNAs that are typically targeted in RNAseq experiments are single stranded
(e.g., mRNAs) and thus have polarity (5’ and 3’ ends that are functionally
distinct).
<br><br>During a typical RNAseq experiment the information about
strandedness is lost after both strands of cDNA are synthesized, size
selected, and converted into sequencing library. However, this information
can be quite useful for the read counting.<br><br>
Some library preparation protocols create so called <i>stranded</i> RNAseq
libraries that preserve the strand information (an excellent overview in
<a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3005310/'>Levin et
al, Nat Meth, 2010</a>). The implication of stranded RNAseq is that you
can distinguish whether the reads are derived from forward- or
reverse-encoded transcripts. Depending on the approach and whether one
performs single- or paired-end sequencing there are multiple possibilities
on how to interpret the results of mapping of these reads onto the genome
backdrop: true
- title: Estimation of the strandness
content: >-
In practice, with Illumina paired-end RNAseq protocols, you are unlikely
to uncover many of these possibilities. You will either deal with:
<ul>
<li>Unstranded RNAseq data</li>
<li>Stranded RNAseq data produced with Illumina TrueSeq RNAseq kits and <a href='https://nar.oxfordjournals.org/content/37/18/e123'>dUTP tagging</a> (<b>ISR</b>)</li>
</ul>
This information should usually come with your FASTQ files, ask your
sequencing facility! If not, try to find them on the site where you
downloaded the data or in the corresponding publication.<br>
Another option is to estimate these parameters with a tool called <b>Infer
Experiment</b>. This tool takes the output of your mappings (BAM files),
takes a subsample of your reads and compares their genome coordinates and
strands with those of the reference gene model (from an annotation file).
Based on the strand of the genes, it can gauge whether sequencing is
strand-specific, and if so, how reads are stranded.
backdrop: true
- title: Determining the library strandness
element: '#tool-search-query'
content: Search for 'Infer Experiment' tool.
placement: right
textinsert: Infer Experiment
- title: Determining the library strandness
element: '#tool-search'
content: Click on the 'Infer Experiment' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fnilesh%2Frseqc%2Frseqc_infer_experiment%2F2.6.4"]
.tool-old-link
- title: Determining the library strandness
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Input .bam file" to the STAR-generated `BAM` files (multiple
datasets)</li>
<li>"Reference gene model" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
<li>"Number of reads sampled from SAM/BAM file (default = 200000)" to `200000`</li>
</ul>
position: right
- title: The output
element: '.history-right-panel .list-items > *:first'
content: >-
The tool generates one file with:
<ul>
<li>Paired-end or singled-end library</li>
<li>Fraction of reads failed to determine</li>
<li>2 lines
<ul>
<li>For single-end<ul>
<li>Fraction of reads explained by "++,–"</li>
<li>Fraction of reads explained by "+-,-+"</li>
</ul>
</il>
<li>For paired-end
<ul>
<li>Fraction of reads explained by "1++,1–,2+-,2-+"</li>
<li>Fraction of reads explained by "1+-,1-+,2++,2–"</li></ul></li>
</ul>
</li>
</ul>
If the fractions in the two last lines are too close to each other,
we conclude that this is the library is not specific to a strand specific
dataset (U in previous figure).
position: left
- title: Questions
content: |-
<ul>
<li>Which fraction of the reads in the BAM file can be explained assuming which library type for `GSM461177`?</li>
<li>Which library type do you choose for both samples?</li>
</ul>
backdrop: false
- title: Counting
content: >-
We now run <b>featureCounts</b> to count the number of reads per annotated
gene.
backdrop: true
- title: Counting the number of reads per annotated gene
element: '#tool-search-query'
content: Search for 'featureCounts' tool.
placement: right
textinsert: featureCounts
- title: Counting the number of reads per annotated gene
element: '#tool-search'
content: Click on the 'featureCounts' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffeaturecounts%2Ffeaturecounts%2F1.6.0.3"]
.tool-old-link
- title: Counting the number of reads per annotated gene
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Alignment file" to the STAR-generated `BAM` files (multiple datasets)</li>
<li>"Gene annotation file" to `GTF file`</li>
<li>"Gene annotation file" to `in your history`</li>
<li>"Gene annotation file" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
<li>"Output format" to `Gene-ID "\t" read-count (DESeq2 IUC wrapper compatible)`</li>
<li>Click on "Advanced options"</li>
<li>"GFF feature type filter" to `exon`</li>
<li>"GFF gene identifier" to `gene_id`</li>
<li>"Allow read to contribute to multiple features" to `No`</li>
<li>"Strand specificity of the protocol" to `Unstranded`</li>
<li>"Count multi-mapping reads/fragments" to `Disabled; multi-mapping reads are excluded (default)`</li>
<li>"Minimum mapping quality per read" to `10`</li>
</ul>
position: right
- title: Counting the number of reads per annotated gene
element: '#tool-search-query'
content: Search for 'MultiQC' tool.
placement: right
textinsert: MultiQC
- title: Counting the number of reads per annotated gene
element: '#tool-search'
content: Click on the 'MultiQC' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
.tool-old-link
- title: Counting the number of reads per annotated gene
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Which tool was used generate logs?" to `featureCounts`</li>
<li>"Output of FeatureCounts" to the generated `summary` files (multiple
datasets)</li>
</ul>
position: right
- title: Questions
content: |-
<ul>
<li>How many reads have been assigned to a gene?</li>
</ul>
backdrop: false
- title: The output
element: '.history-right-panel .list-items > *:first'
content: The main output of <b>featureCounts</b> is a big table.
position: left
- title: Questions
content: |-
<ul>
<li>Which information does the generated table files contain?</li>
<li>Which feature has the most reads mapped on it for both samples?</li>
</ul>
backdrop: false
- title: Identification of the differentially expressed features
content: >-
So far we counted reads that mapped to genes for two sample. To be able to
identify differential gene expression induced by PS depletion, all
datasets (3 treated and 4 untreated) must be analyzed following the same
procedure and for the whole genome.
backdrop: true
- title: Identification of the differentially expressed features
content: >-
To save time, we have run the necessary steps for you and obtained 7 count
files, available on <a
href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a>.
<br><br>These files contain for each gene of Drosophila the number of
reads mapped to it. We could compare the files directly and calculate the
extent of differential gene expression, but the number of sequenced reads
mapped to a gene depends on:
<ul>
<li>Its own expression level</li>
<li>Its length</li>
<li>The sequencing depth of the sample</li>
<li>The expression of all other genes within the sample</li>
</ul>
backdrop: true
- title: Identification of the differentially expressed features
content: >-
Either for within- or for between-sample comparison, the gene counts need
to be normalized. We can then use the Differential Gene Expression (DGE)
analysis, whose two basic tasks are:
<ul>
<li>Estimate the biological variance using the replicates for each condition</li>
<li>Estimate the significance of expression differences between any two conditions</li>
</ul>
This expression analysis is estimated from read counts and attempts are
made to correct for variability in measurements using replicates that are
absolutely essential for accurate results. For your own analysis, we
advice you to use at least 3, but preferably 5 biological replicates per
condition. You can have different number of replicates per condition.
backdrop: true
- title: Identification of the differentially expressed features
content: >-
<a
href="https://bioconductor.org/packages/release/bioc/html/DESeq2.html">DESeq2</a>
is a great tool for DGE analysis. It takes read counts produced
previously, combines them into a big table (with genes in the rows and
samples in the columns) and applies size factor normalization:
<ul>
<li>Computation for each gene of the geometric mean of read counts across all samples</li>
<li>Division of every gene count by the geometric mean</li>
<li>Use of the median of these ratios as a sample’s size factor for normalization</li>
</ul>
Multiple factors with several levels can then be incorporated in the
analysis. After normalization we can compare, in a statistically reliable
way, the response of the expression of any gene to the presence of
different levels of a factor.<br><br>
backdrop: true
- title: Identification of the differentially expressed features
content: >-
In our example, we have samples with two varying factors that can explain
differences in gene expression:
<ul>
<li>Treatment (either treated or untreated)</li>
<li>Sequencing type (paired-end or single-end)</li>
</ul>
Here treatment is the primary factor which we are interested in. The
sequencing type is some further information that we know about the data
that might affect the analysis. This particular multi-factor analysis
allows us to assess the effect of the treatment, while taking the
sequencing type into account, too.
backdrop: true
- title: Data upload
content: >-
Import the seven count files from <a
href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a> or the data
library.
backdrop: true
- title: Uploading the new data
element: '#tool-panel-upload-button .fa.fa-upload'
content: We need to upload data. Open the Galaxy Upload Manager
placement: right
postclick:
- '#tool-panel-upload-button .fa.fa-upload'
- '#btn-reset'
- title: Uploading the input data
element: '#btn-new'
content: Click on Paste/Fetch Data
placement: right
postclick:
- '#btn-new'
- title: Uploading the input data
element: .upload-text-column .upload-text .upload-text-content.form-control
content: Load the data into your history by providing the links
placement: right
textinsert: |-
https://zenodo.org/record/1185122/files/GSM461176_untreat_single.counts
https://zenodo.org/record/1185122/files/GSM461177_untreat_paired.counts
https://zenodo.org/record/1185122/files/GSM461178_untreat_paired.counts
https://zenodo.org/record/1185122/files/GSM461179_treat_single.counts
https://zenodo.org/record/1185122/files/GSM461180_treat_paired.counts
https://zenodo.org/record/1185122/files/GSM461181_treat_paired.counts
https://zenodo.org/record/1185122/files/GSM461182_untreat_single.counts
backdrop: false
- title: Uploading the input data
element: '#btn-start'
content: Click on "Start" to start loading the data to history
placement: right
postclick:
- '#btn-start'
- title: Uploading the input data
element: '#btn-close'
content: >-
The upload may take a while.<br> Hit the close button to close this
window.
placement: right
postclick:
- '#btn-close'
- title: Determines differentially expressed features
element: '#tool-search-query'
content: Search for 'DESeq2' tool.
placement: right
textinsert: DESeq2
- title: Determines differentially expressed features
element: '#tool-search'
content: Click on the 'DESeq2' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fdeseq2%2Fdeseq2%2F2.11.40.1"]
.tool-old-link
- title: Determines differentially expressed features 1/2
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>For "1: Factor"<ul>
<li>"Specify a factor name" to `Treatment`</li>
<li>"1: Factor level"
<ul>
<li>"Specify a factor level" to `treated`</li>
<li>"Counts file(s)" to the 3 gene count files (multiple datasets) with `treated` in name</li>
</ul>
</li>
<li>"2: Factor level"
<ul>
<li>"Specify a factor level" to `untreated`</li>
<li>"Counts file(s)" to the 4 gene count files (multiple datasets) with `untreated` in name</li>
</ul>
</li>
</ul>
position: right
- title: Determines differentially expressed features 2/2
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>Click on "Insert Factor" (not on "Insert Factor level")</li>
<li>For "2: Factor"
<ul>
<li>"Specify a factor name" to `Sequencing`</li>
<li>"1: Factor level"
<ul>
<li>"Specify a factor level" to `PE`</li>
<li>"Counts file(s)" to the generated count files (multiple datasets) with `paired` in name</li>
</ul>
</li>
<li>"2: Factor level"
<ul>
<li>"Specify a factor level" to `SE`</li>
<li>"Counts file(s)" to the generated count files (multiple datasets) with `single` in name</li>
</ul>
</li>
</ul>
</li>
<li>"Output normalized counts table" to `Yes`</li>
</ul>
position: right
- title: Determines differentially expressed features
element: '.history-right-panel .list-items > *:first'
content: >-
<b>DESeq2</b> generated 3 outputs
<ul>
<li>A table with the normalized counts for each genes (rows) and each samples (columns)</li>
<li>A graphical summary of the results, useful to evaluate the quality of the experiment:<ul>
<li>Histogram of <i>p</i>-values for all tests</li>
<li><a href="https://en.wikipedia.org/wiki/MA_plot">MA plot</a>: global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.1) are colored in red.</li>
<li>Principal Component Analysis (<a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>) and the first two axes</li></ul></li>
</ul>
Each replicate is plotted as an individual data point. This type of
plot is useful for visualizing the overall effect of experimental
covariates and batch effects.
position: left
- title: Questions
content: |-
<ul>
<li>What is the first axis separating?</li>
<li>And the second axis?</li>
</ul>
backdrop: false
- title: Determines differentially expressed features
element: '.history-right-panel .list-items > *:first'
content: >-
<ul>
<li>Heatmap of sample-to-sample distance matrix: overview over similarities and dissimilarities between samples</li>
<li>Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)</li>
</ul>
This dispersion plot is typical, with the final estimates shrunk
from the gene-wise estimates towards the fitted estimates. Some gene-wise
estimates are flagged as outliers and not shrunk towards the fitted value.
The amount of shrinkage can be more or less than seen here, depending on
the sample size, the number of coefficients, the row mean and the
variability of the gene-wise estimates.
position: left
- title: Questions
content: |-
<ul>
<li>How are the samples grouped?</li>
</ul>
backdrop: false
- title: Determines differentially expressed features
element: '.history-right-panel .list-items > *:first'
content: |-
A summary file with the following values for each gene:
<ul>
<li>Gene identifiers</li>
<li>Mean normalized counts, averaged over all samples from both conditions</li>
<li>Logarithm (to basis 2) of the fold change</li>
<li>Standard error estimate for the log2 fold change estimate</li>
<li><a href="https://en.wikipedia.org/wiki/Wald_test">Wald</a> statistic</li>
<li><i>p</i>-value for the statistical significance of this change</li>
<li><i>p</i>-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (<a href="https://en.wikipedia.org/wiki/False_discovery_rate">FDR</a>)</li>
</ul>
position: left
- title: Visualization of the differentially expressed genes
content: >-
We would like now to draw an heatmap of the normalized counts for each
sample for the most differentially expressed genes.
We would proceed in several steps
<ul>
<li>Extract the most differentially expressed genes using the DESeq2 summary file</li>
<li>Extract the normalized counts of these genes for each sample using the normalized count file generated by DESeq2</li>
<li>Plot the heatmap of the normalized counts of these genes for each sample</li>
</ul>
backdrop: true
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Filter' tool.
placement: right
textinsert: Filter
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Filter' tool to open it.
placement: right
postclick:
- 'a[href$="/tool_runner?tool_id=Filter1"] .tool-old-link'
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>"Filter" to the DESeq2 summary file</li>
<li>"With following condition" to `c7<0.05`</li>
</ul>
position: right
- title: Questions
content: |-
<ul>
<li>How many genes have a significant change in gene expression between these conditions?</li>
</ul>
backdrop: false
- title: Extract the most differentially expressed genes
content: >-
The generated file contains to many genes to get a meaningful heatmap. So
we will take only the genes with an absoluted fold change > 2
backdrop: true
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Filter' tool.
placement: right
textinsert: Filter
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Filter' tool to open it.
placement: right
postclick:
- 'a[href$="/tool_runner?tool_id=Filter1"] .tool-old-link'
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>"Filter" to the differentially expressed genes</li>
<li>"With following condition" to `abs(c3)>1`</li>
</ul>
position: right
- title: Questions
content: |-
<ul>
<li>How many genes have been conserved?</li>
</ul>
backdrop: false
- title: Extract the most differentially expressed genes
element: '.history-right-panel .list-items > *:first'
content: >-
The number of genes is still too high there. So we will take only the 10
most up-regulated and 10 most down-regulated genes
position: left
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Sort' tool.
placement: right
textinsert: Sort
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Sort' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_sort_header_tool%2F1.1.1"]
.tool-old-link
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Sort Dataset" to the differentially expressed genes with abs(FC) > 2</li>
<li>"on column" to `3`</li>
<li>"with flavor" to `Numerical sort`</li>
<li>"everything in" to `Descending order`</li>
</ul>
position: right
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Select first lines' tool.
placement: right
textinsert: Select first lines
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Select first lines' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_head_tool%2F1.1.0"]
.tool-old-link
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>"File to select" to the sorted DE genes with abs(FC) > 2</li>
<li>"Operation" to `Keep first lines`</li>
<li>"Number of lines" to `10`</li>
</ul>
position: right
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Select last lines' tool.
placement: right
textinsert: Select last lines
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Select last lines' tool to open it.
placement: right
postclick:
- 'a[href$="/tool_runner?tool_id=Show+tail1"] .tool-old-link'
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: |-
Run the tool with the following parameters:
<ul>
<li>"Text file" to the sorted DE genes with abs(FC) > 2</li>
<li>"Operation" to `Keep first lines`</li>
<li>"Number of lines" to `10`</li>
</ul>
position: right
- title: Extract the most differentially expressed genes
element: '#tool-search-query'
content: Search for 'Concatenate datasets' tool.
placement: right
textinsert: Concatenate datasets
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: Click on the 'Concatenate datasets' tool to open it.
placement: right
postclick:
- >-
a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cat%2F0.1.0"]
.tool-old-link
- title: Extract the most differentially expressed genes
element: '#tool-search'
content: >-
Run the tool with the following parameters:
<ul>
<li>"Datasets to concatenate" to the 10 most up-regulated genes and to the
10 most down-regulated genes</li>
</ul>
position: right
- title: Extract the most differentially expressed genes
content: >-
We now have a table with 20 lines corresponding to the most differentially
expressed genes. And for each of the gene, we have its id, its mean
normalized counts (averaged over all samples from both conditions), its
log2FC and other information.<br><br>
We could plot the log2FC for the different genes, but here we would like
to look at the heatmap with the read counts for these genes in the
different samples. So we need to extract the read counts for these
genes.<br><br>
We will join the normalized count table generated by DESeq with the table
we just generated to conserved in the normalized count table only the
lines corresponding to the most differentially expressed genes
backdrop: true
- title: >-
Extract the normalized counts of most differentially expressed genes in
the different samples
element: '#tool-search-query'
content: Search for 'Join two Datasets' tool.
placement: right
textinsert: Join two Datasets
- title: >-
Extract the normalized counts of most differentially expressed genes in
the different samples
element: '#tool-search'
content: Click on the 'Join two Datasets' tool to open it.
placement: right
postclick: