ref_based_rna-seq.yaml

﻿---
id: ref-based
name: Reference-based RNA-Seq data analysis
description: >-
  In this tutorial we will align against a reference genome, Drosophila
  melanogaster, to significantly improve the ability to reconstruct transcripts
  and then identify differences of expression between several conditions.
title_default: peptide-protein-id
tags:
  - "RNA"

steps:
  - title: Introduction
    content: >-
      In this tutorial we will align against a reference genome, Drosophila
      melanogaster, to significantly improve the ability to reconstruct
      transcripts and then identify differences of expression between several
      conditions.
    backdrop: true
  - title: Introduction
    content: >-
      In the study of <a
      href="http://genome.cshlp.org/content/21/2/193.long">Brooks et al.
      2011</a>, the <i>Pasilla (PS)</i> gene, <i>Drosophila</i> homologue of the
      Human splicing regulators Nova-1 and Nova-2 Proteins, was depleted in
      <i>Drosophila melanogaster</i> by RNAi. The authors wanted to identify
      exons that are regulated by <i>Pasilla</i> gene using RNA sequencing
      data.
    backdrop: true
  - title: Introduction
    content: >-
      Total RNA was isolated and used for preparing either single-end or
      paired-end RNA-seq libraries for treated (PS depleted) samples and
      untreated samples. These libraries were sequenced to obtain a collection
      of RNA sequencing reads for each sample. The effects of <i>Pasilla</i>
      gene depletion on splicing events can then be analyzed by comparison of
      RNA sequencing data of the treated (PS depleted) and the untreated
      samples. <br><br>The genome of <i>Drosophila melanogaster</i> is known and
      assembled. It can be used as reference genome to ease this analysis. In a
      reference based RNA-seq data analysis, the reads are aligned (or mapped)
      against a reference genome, <i>Drosophila melanogaster</i> here, to
      significantly improve the ability to reconstruct transcripts and then
      identify differences of expression between several conditions.
    backdrop: true
  - title: Data upload
    content: >-
      The original data is available at NCBI Gene Expression Omnibus (GEO) under
      accession number <a
      href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18508">GSE18508</a>.
      <br><br>We will look at the 7 first samples:
      <ul>
        <li>3 treated samples with <i>Pasilla</i> (PS) gene depletion: <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461179">GSM461179</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461180">GSM461180</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461181">GSM461181</a></li>
        <li>4 untreated samples: <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461176">GSM461176</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461177">GSM461177</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461178">GSM461178</a>, <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM461182">GSM461182</a></li>
      </ul>
      <br>Each sample constitutes a separate biological replicate of the
      corresponding condition (treated or untreated). Moreover, two of the
      treated and two of the untreated samples are from a paired-end sequencing
      assay, while the remaining samples are from a single-end sequencing
      experiment.<br><br> We have extracted sequences from the Sequence Read
      Archive (SRA) files to build FASTQ files.
    backdrop: true
  - title: History options
    element: '#history-options-button'
    content: >-
      Create a new history for this RNA-seq exercise. Click on this button and
      then "Create New"
    placement: left
  - title: Importing data via links
    content: >-
      Import files from <a href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a>.
    backdrop: true
  - title: Uploading the new data
    element: '#tool-panel-upload-button .fa.fa-upload'
    content: We need to upload data. Open the Galaxy Upload Manager
    placement: right
    postclick:
      - '#tool-panel-upload-button .fa.fa-upload'
      - '#btn-reset'
  - title: Uploading the input data
    element: '#btn-new'
    content: Click on Paste/Fetch Data
    placement: right
    postclick:
      - '#btn-new'
  - title: Uploading the input data
    element: .upload-text-column .upload-text .upload-text-content.form-control
    content: Load the data into your history by providing the links
    placement: right
    textinsert: |-
      https://zenodo.org/record/1185122/files/GSM461177_1.fastqsanger
      https://zenodo.org/record/1185122/files/GSM461177_2.fastqsanger
      https://zenodo.org/record/1185122/files/GSM461180_1.fastqsanger
      https://zenodo.org/record/1185122/files/GSM461180_2.fastqsanger
    backdrop: false
  - title: Uploading the input data
    element: '#btn-start'
    content: Click on "Start" to start loading the data to history
    placement: right
    postclick:
      - '#btn-start'
  - title: Uploading the input data
    element: '#btn-close'
    content: >-
      The upload may take a while.<br> Hit the close button to close this
      window.
    placement: right
    postclick:
      - '#btn-close'
  - title: Rename the input data
    element: '.history-right-panel .list-items > *:first'
    content: >-
      The uploaded datasets is in the history, but its name corresponds to the
      link. We want to rename them it to something more meaningful<br><br>  <ul>
        <li>Click on the pencil icon beside the file to "Edit Attributes".</li>
        <li>Change the "<b>Name:</b>" accordingly.</li>
        <li>Make sure "<b>datatype"</b> is set to "fastqsanger"</li>
      </ul>
    position: left
  - title: Adding a tag
    element: '.history-right-panel .list-items > *:first'
    content: >-
      In order to each database a tag corresponding to the name of the sample
      (`#GSM461177` or `#GSM461180`)
      <ul>
        <li>Click on the dataset</li>
        <li>Click on <b>Edit dataset tags</b></li>
        <li>Add the tag starting with `#`</li>
      </ul>
    position: left
  - title: Quality control
    content: >-
      The sequences are raw data from the sequencing machine, without any
      pretreatments. They need to be assessed for their quality.<br><br>
      For quality control, we use similar tools as described in <a
      href="http://galaxyproject.github.io/training-material/topics/sequence-analysis">NGS-QC
      tutorial</a>: <a
      href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a>
      and <a
      href="https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/">Trim
      Galore</a>.
    backdrop: true
  - title: Quality control
    element: '#tool-search-query'
    content: Search for 'FastQC' tool.
    placement: right
    textinsert: FastQC
  - title: Quality control
    element: '#tool-search'
    content: Click on the 'FastQC' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fdevteam%2Ffastqc%2Ffastqc%2F0.69"]
        .tool-old-link
  - title: Quality control
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Short read data from your current history" to `Multiple datasets`</li>
      </ul>
    position: right
  - title: Quality control
    element: '.history-right-panel .list-items > *:first'
    content: Inspect on the generated webpage for GSM461177_1 sample.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>What is the read length?</li>
      </ul>
    backdrop: false
  - title: Quality control
    element: '#tool-search-query'
    content: Search for 'MultiQC' tool.
    placement: right
    textinsert: MultiQC
  - title: Quality control
    element: '#tool-search'
    content: Click on the 'MultiQC' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
  - title: Quality control
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Which tool was used generate logs?" to `FastQC`</li>
        <li>Type of FastQC output?" to `Raw data`</li>
        <li>"FastQC output" to the generated Raw data files (multiple datasets)</li>
      </ul>
    position: right
  - title: Quality control
    element: '.history-right-panel .list-items > *:first'
    content: Inspect the webpage output from MultiQC
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>What is the quality for the sequences for the different files?</li>
      </ul>
    backdrop: false
  - title: Quality control
    element: '#tool-search-query'
    content: Search for 'Trim Galore' tool.
    placement: right
    textinsert: Trim Galore
  - title: Quality control
    element: '#tool-search'
    content: Click on the 'Trim Galore' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftrim_galore%2Ftrim_galore%2F0.4.3.1"]
        .tool-old-link
  - title: Quality control
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Is this library paired- or single-end?" to `Paired-end`</li>
        <li>First "Reads in FASTQ format" to both `_1` fastqsanger datasets (multiple datasets)</li>
        <li>Second "Reads in FASTQ format" to both `_2` fastqsanger datasets (multiple datasets)</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>Why do we run Trim Galore! only once on a paired-end dataset and not twice, once for each dataset?</li>
      </ul>
    backdrop: false
  - title: Mapping
    content: >-
      As the genome of <i>Drosophila melanogaster</i> is known and assembled, we
      can use this information and map the sequences on this genome to identify
      the effects of <i>Pasilla</i> gene depletion on splicing events.<br><br>

      To make sense of the reads, we need to determine to which genes
      they belong. The first step is to determine their positions within the
      <i>Drosophila melanogaster</i> genome. This process is known as aligning
      or ‘mapping’ the reads to a reference.<br><br>

      Because in the case of a eukaryotic transcriptome, most reads
      originate from processed mRNAs lacking introns, they cannot be simply
      mapped back to the genome as we normally do for DNA data. Instead the
      reads must be separated into two categories:
      <ul>
        <li>Reads that map entirely within exons</li>
        <li>Reads that cannot be mapped within an exon across their entire length because they span two or more exons</li>
      </ul>
    backdrop: true
  - title: Mapping
    element: '#tool-search-query'
    content: Search for 'RNA STAR' tool.
    placement: right
    textinsert: RNA STAR
  - title: Mapping
    element: '#tool-search'
    content: Click on the 'RNA STAR' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Frgrnastar%2Frna_star%2F2.5.2b-0"]
        .tool-old-link
  - title: Mapping
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Single-end or paired-end reads" to `Paired-end (as individual datasets)`</li>
        <li>"RNA-Seq FASTQ/FASTA file, forward reads" to the generated `trimmed reads pair 1` files (multiple datasets)</li>
        <li>"RNA-Seq FASTQ/FASTA file, reverse reads" to the generated `trimmed reads pair 2` files (multiple datasets)</li>
        <li>"Custom or built-in reference genome" to `Use a built-in index`</li>
        <li>"Reference genome with or without an annotation" to `use genome reference without builtin gene-model`</li>
        <li>"Select reference genome" to `Drosophila Melanogaster (dm6)`</li>
        <li>"Gene model (gff3,gtf) file for splice junctions" to the imported `Drosophila_melanogaster.BDGP6.87.gtf`</li>
        <li>"Length of the genomic sequence around annotated junctions" to `36`</li></ul>
      </ul>
    position: right
  - title: Mapping
    element: '#tool-search-query'
    content: Search for 'MultiQC' tool.
    placement: right
    textinsert: MultiQC
  - title: Mapping
    element: '#tool-search'
    content: Click on the 'MultiQC' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
        .tool-old-link
  - title: Mapping
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Which tool was used generate logs?" to `STAR`</li>
        <li>"Type of FastQC output?" to `Log`</li>
        <li>"STAR log output" to the generated `log` files (multiple datasets)</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>Which percentage of reads were mapped exactly once for both samples?</li>
        <li>What is a BAM file?</li>
        <li>What does such a file contain?</li>
      </ul>
    backdrop: false
  - title: Inspection of the mapping results with IGV
    content: >-
      The BAM file contains information about where the reads are mapped on the
      reference genome. But it is a binary file and with the information for
      more than 3 million reads encoded in it, it is difficult to inspect and
      explore the file.

      <br><br>A powerful tool to visualize the content of BAM files is the
      Integrative Genomics Viewer IGV.
    backdrop: true
  - title: Inspection of the mapping results with IGV
    element: '.history-right-panel .list-items > *:first'
    content: |-
      <ul>Visualize the aligned reads for `GSM461177`
        <li>Click on the STAR BAM output in your history to expand it.</li>
        <li>Towards the bottom of the history item, find the line starting with `Display with IGV`</li>
      </ul>
    position: left
  - title: Inspection of the mapping results with IGV
    content: 'Zoom to `chr4:540,000-560,000` (Chromosome 4 between 540 kb to 560 kb)'
    backdrop: false
  - title: Questions
    content: |-
      <ul>
        <li>Which information does appear on the top in grey?</li>
        <li>What do the connecting lines between some of the aligned reads indicate?</li>
      </ul>
    backdrop: false
  - title: Creation of a Sashimi plot
    content: |-
      <ul>
        <li>Right click on the BAM file</li>
        <li>Select Sashimi Plot from the context menu</li>
      </ul>
    backdrop: false
  - title: Questions
    content: |-
      <ul>
        <li>What does the vertical bar graph represent? And the numbered arcs?</li>
        <li>What do the numbers on the arcs mean?</li>
        <li>Why do we observe different stacked groups of blue linked boxes at the bottom?</li>
      </ul>
    backdrop: false
  - title: Aftermath
    content: >-
      After the mapping, we have the information on where the reads are located
      on the reference genome. We also know how well they were mapped.<br><br>

      The next step in the RNA-Seq data analysis is quantification of expression
      level of the genomic features (gene, transcript, exons, …) to be able then
      to compare several samples for the different expression analysis. The
      quantification consist into taking each known genomic feature (e.g. gene)
      of the reference genome and then counting how many reads are mapped on
      this genomic feature. So, in this step, we start with an information per
      mapped reads to end with an information per genomic feature.

      <br<br>To identify exons that are regulated by the <i>Pasilla</i> gene, we
      need to identify genes and exons which are differentially expressed
      between samples with PS gene depletion and control samples. In this
      tutorial, we will then analyze the differential gene expression, but also
      the differential exon usage.
    backdrop: true
  - title: Aftermath
    content: >-
      To identify exons that are regulated by the Pasilla gene, we need to
      identify genes and exons which are differentially expressed between
      samples with PS gene depletion and control samples. In this tutorial, we
      will then analyze the differential gene expression, but also the
      differential exon usage.
    backdrop: true
  - title: Aftermath
    content: >-
      We will first investigate the differential gene expression to identify
      which genes are impacted by the <i>Pasilla</i> gene depletion.

      <br><br>To compare the expression of single genes between different
      conditions (e.g. with or without PS depletion), an essential first step is
      to quantify the number of reads per gene.<br><br>

      Two main tools could be used for that: <a
      href='http://htseq.readthedocs.io/en/release_0.9.1/count.html'>HTSeq-count</a>
      (<a
      href='https://academic.oup.com/bioinformatics/article/31/2/166/2366196'>Anders
      et al, Bioinformatics, 2015</a>) or featureCounts (<a
      href='https://academic.oup.com/bioinformatics/article/31/2/166/2366196'>Liao
      et al, Bioinformatics, 2014</a>). The second one is considerably faster
      and requires far less computational resources. We will use it.
    backdrop: true
  - title: Estimation of the strandness
    content: >-
      RNAs that are typically targeted in RNAseq experiments are single stranded
      (e.g., mRNAs) and thus have polarity (5’ and 3’ ends that are functionally
      distinct).

      <br><br>During a typical RNAseq experiment the information about
      strandedness is lost after both strands of cDNA are synthesized, size
      selected, and converted into sequencing library. However, this information
      can be quite useful for the read counting.<br><br>

      Some library preparation protocols create so called <i>stranded</i> RNAseq
      libraries that preserve the strand information (an excellent overview in
      <a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3005310/'>Levin et
      al, Nat Meth, 2010</a>). The implication of stranded RNAseq is that you
      can distinguish whether the reads are derived from forward- or
      reverse-encoded transcripts. Depending on the approach and whether one
      performs single- or paired-end sequencing there are multiple possibilities
      on how to interpret the results of mapping of these reads onto the genome
    backdrop: true
  - title: Estimation of the strandness
    content: >-
      In practice, with Illumina paired-end RNAseq protocols, you are unlikely
      to uncover many of these possibilities. You will either deal with:
      <ul>
        <li>Unstranded RNAseq data</li>
        <li>Stranded RNAseq data produced with Illumina TrueSeq RNAseq kits and <a href='https://nar.oxfordjournals.org/content/37/18/e123'>dUTP tagging</a> (<b>ISR</b>)</li>
      </ul>
      This information should usually come with your FASTQ files, ask your
      sequencing facility! If not, try to find them on the site where you
      downloaded the data or in the corresponding publication.<br>

      Another option is to estimate these parameters with a tool called <b>Infer
      Experiment</b>. This tool takes the output of your mappings (BAM files),
      takes a subsample of your reads and compares their genome coordinates and
      strands with those of the reference gene model (from an annotation file).
      Based on the strand of the genes, it can gauge whether sequencing is
      strand-specific, and if so, how reads are stranded.
    backdrop: true
  - title: Determining the library strandness
    element: '#tool-search-query'
    content: Search for 'Infer Experiment' tool.
    placement: right
    textinsert: Infer Experiment
  - title: Determining the library strandness
    element: '#tool-search'
    content: Click on the 'Infer Experiment' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fnilesh%2Frseqc%2Frseqc_infer_experiment%2F2.6.4"]
        .tool-old-link
  - title: Determining the library strandness
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Input .bam file" to the STAR-generated `BAM` files (multiple
      datasets)</li>
        <li>"Reference gene model" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
        <li>"Number of reads sampled from SAM/BAM file (default = 200000)" to `200000`</li>
      </ul>
    position: right
  - title: The output
    element: '.history-right-panel .list-items > *:first'
    content: >-
      The tool generates one file with:
      <ul>
        <li>Paired-end or singled-end library</li>
        <li>Fraction of reads failed to determine</li>
        <li>2 lines
          <ul>
            <li>For single-end<ul>
            <li>Fraction of reads explained by "++,–"</li>
            <li>Fraction of reads explained by "+-,-+"</li>
            </ul>
        </il>
        <li>For paired-end
          <ul>
            <li>Fraction of reads explained by "1++,1–,2+-,2-+"</li>
            <li>Fraction of reads explained by "1+-,1-+,2++,2–"</li></ul></li>
          </ul>
        </li>
      </ul>
      If the fractions in the two last lines are too close to each other,
      we conclude that this is the library is not specific to a strand specific
      dataset (U in previous figure).
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>Which fraction of the reads in the BAM file can be explained assuming which library type for `GSM461177`?</li>
        <li>Which library type do you choose for both samples?</li>
      </ul>
    backdrop: false
  - title: Counting
    content: >-
      We now run <b>featureCounts</b> to count the number of reads per annotated
      gene.
    backdrop: true
  - title: Counting the number of reads per annotated gene
    element: '#tool-search-query'
    content: Search for 'featureCounts' tool.
    placement: right
    textinsert: featureCounts
  - title: Counting the number of reads per annotated gene
    element: '#tool-search'
    content: Click on the 'featureCounts' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffeaturecounts%2Ffeaturecounts%2F1.6.0.3"]
        .tool-old-link
  - title: Counting the number of reads per annotated gene
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Alignment file" to the STAR-generated `BAM` files (multiple datasets)</li>
        <li>"Gene annotation file" to `GTF file`</li>
        <li>"Gene annotation file" to `in your history`</li>
        <li>"Gene annotation file" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
        <li>"Output format" to `Gene-ID "\t" read-count (DESeq2 IUC wrapper compatible)`</li>
        <li>Click on "Advanced options"</li>
        <li>"GFF feature type filter" to `exon`</li>
        <li>"GFF gene identifier" to `gene_id`</li>
        <li>"Allow read to contribute to multiple features" to `No`</li>
        <li>"Strand specificity of the protocol" to `Unstranded`</li>
        <li>"Count multi-mapping reads/fragments" to `Disabled; multi-mapping reads are excluded (default)`</li>
        <li>"Minimum mapping quality per read" to `10`</li>
      </ul>
    position: right
  - title: Counting the number of reads per annotated gene
    element: '#tool-search-query'
    content: Search for 'MultiQC' tool.
    placement: right
    textinsert: MultiQC
  - title: Counting the number of reads per annotated gene
    element: '#tool-search'
    content: Click on the 'MultiQC' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fmultiqc%2Fmultiqc%2F1.3.1"]
        .tool-old-link
  - title: Counting the number of reads per annotated gene
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Which tool was used generate logs?" to `featureCounts`</li>
        <li>"Output of FeatureCounts" to the generated `summary` files (multiple
      datasets)</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>How many reads have been assigned to a gene?</li>
      </ul>
    backdrop: false
  - title: The output
    element: '.history-right-panel .list-items > *:first'
    content: The main output of <b>featureCounts</b> is a big table.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>Which information does the generated table files contain?</li>
        <li>Which feature has the most reads mapped on it for both samples?</li>
      </ul>
    backdrop: false
  - title: Identification of the differentially expressed features
    content: >-
      So far we counted reads that mapped to genes for two sample. To be able to
      identify differential gene expression induced by PS depletion, all
      datasets (3 treated and 4 untreated) must be analyzed following the same
      procedure and for the whole genome.
    backdrop: true
  - title: Identification of the differentially expressed features
    content: >-
      To save time, we have run the necessary steps for you and obtained 7 count
      files, available on <a
      href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a>.

      <br><br>These files contain for each gene of Drosophila the number of
      reads mapped to it. We could compare the files directly and calculate the
      extent of differential gene expression, but the number of sequenced reads
      mapped to a gene depends on:
      <ul>
        <li>Its own expression level</li>
        <li>Its length</li>
        <li>The sequencing depth of the sample</li>
        <li>The expression of all other genes within the sample</li>
      </ul>
    backdrop: true
  - title: Identification of the differentially expressed features
    content: >-
      Either for within- or for between-sample comparison, the gene counts need
      to be normalized. We can then use the Differential Gene Expression (DGE)
      analysis, whose two basic tasks are:
      <ul>
        <li>Estimate the biological variance using the replicates for each condition</li>
        <li>Estimate the significance of expression differences between any two conditions</li>
      </ul>
      This expression analysis is estimated from read counts and attempts are
      made to correct for variability in measurements using replicates that are
      absolutely essential for accurate results. For your own analysis, we
      advice you to use at least 3, but preferably 5 biological replicates per
      condition. You can have different number of replicates per condition.
    backdrop: true
  - title: Identification of the differentially expressed features
    content: >-
      <a
      href="https://bioconductor.org/packages/release/bioc/html/DESeq2.html">DESeq2</a>
      is a great tool for DGE analysis. It takes read counts produced
      previously, combines them into a big table (with genes in the rows and
      samples in the columns) and applies size factor normalization:
      <ul>
        <li>Computation for each gene of the geometric mean of read counts across all samples</li>
        <li>Division of every gene count by the geometric mean</li>
        <li>Use of the median of these ratios as a sample’s size factor for normalization</li>
      </ul>
      Multiple factors with several levels can then be incorporated in the
      analysis. After normalization we can compare, in a statistically reliable
      way, the response of the expression of any gene to the presence of
      different levels of a factor.<br><br>
    backdrop: true
  - title: Identification of the differentially expressed features
    content: >-
      In our example, we have samples with two varying factors that can explain
      differences in gene expression:
      <ul>
        <li>Treatment (either treated or untreated)</li>
        <li>Sequencing type (paired-end or single-end)</li>
      </ul>
      Here treatment is the primary factor which we are interested in. The
      sequencing type is some further information that we know about the data
      that might affect the analysis. This particular multi-factor analysis
      allows us to assess the effect of the treatment, while taking the
      sequencing type into account, too.
    backdrop: true
  - title: Data upload
    content: >-
      Import the seven count files from <a
      href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a> or the data
      library.
    backdrop: true
  - title: Uploading the new data
    element: '#tool-panel-upload-button .fa.fa-upload'
    content: We need to upload data. Open the Galaxy Upload Manager
    placement: right
    postclick:
      - '#tool-panel-upload-button .fa.fa-upload'
      - '#btn-reset'
  - title: Uploading the input data
    element: '#btn-new'
    content: Click on Paste/Fetch Data
    placement: right
    postclick:
      - '#btn-new'
  - title: Uploading the input data
    element: .upload-text-column .upload-text .upload-text-content.form-control
    content: Load the data into your history by providing the links
    placement: right
    textinsert: |-
      https://zenodo.org/record/1185122/files/GSM461176_untreat_single.counts
      https://zenodo.org/record/1185122/files/GSM461177_untreat_paired.counts
      https://zenodo.org/record/1185122/files/GSM461178_untreat_paired.counts
      https://zenodo.org/record/1185122/files/GSM461179_treat_single.counts
      https://zenodo.org/record/1185122/files/GSM461180_treat_paired.counts
      https://zenodo.org/record/1185122/files/GSM461181_treat_paired.counts
      https://zenodo.org/record/1185122/files/GSM461182_untreat_single.counts
    backdrop: false
  - title: Uploading the input data
    element: '#btn-start'
    content: Click on "Start" to start loading the data to history
    placement: right
    postclick:
      - '#btn-start'
  - title: Uploading the input data
    element: '#btn-close'
    content: >-
      The upload may take a while.<br> Hit the close button to close this
      window.
    placement: right
    postclick:
      - '#btn-close'
  - title: Determines differentially expressed features
    element: '#tool-search-query'
    content: Search for 'DESeq2' tool.
    placement: right
    textinsert: DESeq2
  - title: Determines differentially expressed features
    element: '#tool-search'
    content: Click on the 'DESeq2' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fdeseq2%2Fdeseq2%2F2.11.40.1"]
        .tool-old-link
  - title: Determines differentially expressed features 1/2
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>For "1: Factor"<ul>
        <li>"Specify a factor name" to `Treatment`</li>
        <li>"1: Factor level"
          <ul>
            <li>"Specify a factor level" to `treated`</li>
            <li>"Counts file(s)" to the 3 gene count files (multiple datasets) with `treated` in name</li>
          </ul>
        </li>
        <li>"2: Factor level"
          <ul>
            <li>"Specify a factor level" to `untreated`</li>
            <li>"Counts file(s)" to the 4 gene count files (multiple datasets) with `untreated` in name</li>
          </ul>
        </li>
      </ul>
    position: right
  - title: Determines differentially expressed features 2/2
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>Click on "Insert Factor" (not on "Insert Factor level")</li>
        <li>For "2: Factor"
          <ul>
            <li>"Specify a factor name" to `Sequencing`</li>
            <li>"1: Factor level"
              <ul>
                <li>"Specify a factor level" to `PE`</li>
                <li>"Counts file(s)" to the generated count files (multiple datasets) with `paired` in name</li>
              </ul>
            </li>
            <li>"2: Factor level"
              <ul>
                <li>"Specify a factor level" to `SE`</li>
                <li>"Counts file(s)" to the generated count files (multiple datasets) with `single` in name</li>
              </ul>
            </li>
          </ul>
        </li>
        <li>"Output normalized counts table" to `Yes`</li>
      </ul>
    position: right
  - title: Determines differentially expressed features
    element: '.history-right-panel .list-items > *:first'
    content: >-
      <b>DESeq2</b> generated 3 outputs
      <ul>
        <li>A table with the normalized counts for each genes (rows) and each samples (columns)</li>
        <li>A graphical summary of the results, useful to evaluate the quality of the experiment:<ul>
        <li>Histogram of <i>p</i>-values for all tests</li>
        <li><a href="https://en.wikipedia.org/wiki/MA_plot">MA plot</a>: global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.1) are colored in red.</li>
        <li>Principal Component Analysis (<a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>) and the first two axes</li></ul></li>
      </ul>
      Each replicate is plotted as an individual data point. This type of
      plot is useful for visualizing the overall effect of experimental
      covariates and batch effects.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>What is the first axis separating?</li>
        <li>And the second axis?</li>
      </ul>
    backdrop: false
  - title: Determines differentially expressed features
    element: '.history-right-panel .list-items > *:first'
    content: >-
      <ul>
        <li>Heatmap of sample-to-sample distance matrix: overview over similarities and dissimilarities between samples</li>
        <li>Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)</li>
      </ul>
      This dispersion plot is typical, with the final estimates shrunk
      from the gene-wise estimates towards the fitted estimates. Some gene-wise
      estimates are flagged as outliers and not shrunk towards the fitted value.
      The amount of shrinkage can be more or less than seen here, depending on
      the sample size, the number of coefficients, the row mean and the
      variability of the gene-wise estimates.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>How are the samples grouped?</li>
      </ul>
    backdrop: false
  - title: Determines differentially expressed features
    element: '.history-right-panel .list-items > *:first'
    content: |-
      A summary file with the following values for each gene:
      <ul>
        <li>Gene identifiers</li>
        <li>Mean normalized counts, averaged over all samples from both conditions</li>
        <li>Logarithm (to basis 2) of the fold change</li>
        <li>Standard error estimate for the log2 fold change estimate</li>
        <li><a href="https://en.wikipedia.org/wiki/Wald_test">Wald</a> statistic</li>
        <li><i>p</i>-value for the statistical significance of this change</li>
        <li><i>p</i>-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (<a href="https://en.wikipedia.org/wiki/False_discovery_rate">FDR</a>)</li>
      </ul>
    position: left
  - title: Visualization of the differentially expressed genes
    content: >-
      We would like now to draw an heatmap of the normalized counts for each
      sample for the most differentially expressed genes.

      We would proceed in several steps

      <ul>
        <li>Extract the most differentially expressed genes using the DESeq2 summary file</li>
        <li>Extract the normalized counts of these genes for each sample using the normalized count file generated by DESeq2</li>
        <li>Plot the heatmap of the normalized counts of these genes for each sample</li>
      </ul>
    backdrop: true
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Filter' tool.
    placement: right
    textinsert: Filter
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Filter' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=Filter1"] .tool-old-link'
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Filter" to the DESeq2 summary file</li>
        <li>"With following condition" to `c7<0.05`</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>How many genes have a significant change in gene expression between these conditions?</li>
      </ul>
    backdrop: false
  - title: Extract the most differentially expressed genes
    content: >-
      The generated file contains to many genes to get a meaningful heatmap. So
      we will take only the genes with an absoluted fold change > 2
    backdrop: true
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Filter' tool.
    placement: right
    textinsert: Filter
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Filter' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=Filter1"] .tool-old-link'
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Filter" to the differentially expressed genes</li>
        <li>"With following condition" to `abs(c3)>1`</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>How many genes have been conserved?</li>
      </ul>
    backdrop: false
  - title: Extract the most differentially expressed genes
    element: '.history-right-panel .list-items > *:first'
    content: >-
      The number of genes is still too high there. So we will take only the 10
      most up-regulated and 10 most down-regulated genes
    position: left
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Sort' tool.
    placement: right
    textinsert: Sort
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Sort' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_sort_header_tool%2F1.1.1"]
        .tool-old-link
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Sort Dataset" to the differentially expressed genes with abs(FC) > 2</li>
        <li>"on column" to `3`</li>
        <li>"with flavor" to `Numerical sort`</li>
        <li>"everything in" to `Descending order`</li>
      </ul>
    position: right
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Select first lines' tool.
    placement: right
    textinsert: Select first lines
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Select first lines' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_head_tool%2F1.1.0"]
        .tool-old-link
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"File to select" to the sorted DE genes with abs(FC) > 2</li>
        <li>"Operation" to `Keep first lines`</li>
        <li>"Number of lines" to `10`</li>
      </ul>
    position: right
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Select last lines' tool.
    placement: right
    textinsert: Select last lines
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Select last lines' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=Show+tail1"] .tool-old-link'
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Text file" to the sorted DE genes with abs(FC) > 2</li>
        <li>"Operation" to `Keep first lines`</li>
        <li>"Number of lines" to `10`</li>
      </ul>
    position: right
  - title: Extract the most differentially expressed genes
    element: '#tool-search-query'
    content: Search for 'Concatenate datasets' tool.
    placement: right
    textinsert: Concatenate datasets
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: Click on the 'Concatenate datasets' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cat%2F0.1.0"]
        .tool-old-link
  - title: Extract the most differentially expressed genes
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Datasets to concatenate" to the 10 most up-regulated genes and to the
        10 most down-regulated genes</li>
      </ul>
    position: right
  - title: Extract the most differentially expressed genes
    content: >-
      We now have a table with 20 lines corresponding to the most differentially
      expressed genes. And for each of the gene, we have its id, its mean
      normalized counts (averaged over all samples from both conditions), its
      log2FC and other information.<br><br>

      We could plot the log2FC for the different genes, but here we would like
      to look at the heatmap with the read counts for these genes in the
      different samples. So we need to extract the read counts for these
      genes.<br><br>

      We will join the normalized count table generated by DESeq with the table
      we just generated to conserved in the normalized count table only the
      lines corresponding to the most differentially expressed genes
    backdrop: true
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search-query'
    content: Search for 'Join two Datasets' tool.
    placement: right
    textinsert: Join two Datasets
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search'
    content: Click on the 'Join two Datasets' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=join1"] .tool-old-link'
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Join" to the normalized count table generated by DESeq2</li>
        <li>"using column" to `1`</li>
        <li>"with" to the concatenated file with 10 most up-regulated genes and the 10 most down-regulated genes</li>
        <li>"and column" to `1`</li>
        <li>"Keep lines of first input that do not join with second input" to `No`</li>
        <li>"Keep the header lines" to `Yes`</li>
      </ul>
    position: right
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '.history-right-panel .list-items > *:first'
    content: >-
      The generated files has too many columns: the ones with mean normalized
      counts, the log2FC and other information. We need to remove them
    position: left
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search-query'
    content: Search for 'Cut' tool.
    placement: right
    textinsert: Cut
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search'
    content: Click on the 'Cut' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cut_tool%2F1.1.0"]
        .tool-old-link
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:<ul>
      <li>"Cut columns" to `c1,c2,c3,c4,c5,c6,c7,c8`</li>
      <li>"Delimited by" to `Tab`</li>
      <li>"From" the joined dataset</li>
      </ul>
    position: right
  - title: >-
      Extract the normalized counts of most differentially expressed genes in
      the different samples
    element: '.history-right-panel .list-items > *:first'
    content: >-
      We have now a table with 20 lines (the most differentially expressed
      genes) and the normalized counts for these genes in the 7 samples.
    position: left
  - title: Plot the heatmap of the normalized counts of these genes for each sample
    element: '#tool-search-query'
    content: Search for 'heatmap2' tool.
    placement: right
    textinsert: heatmap2
  - title: Plot the heatmap of the normalized counts of these genes for each sample
    element: '#tool-search'
    content: Click on the 'heatmap2' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fggplot2_heatmap2%2Fggplot2_heatmap2%2F2.2.1"]
        .tool-old-link
  - title: Plot the heatmap of the normalized counts of these genes for each sample
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Input should have column headers" to the generated table</li>
        <li>"Advanced - log transformation" to `Log2(value) transform my data`</li>
        <li>"Enable data clustering" to `Yes`</li>
        <li>"Coloring groups" to `Blue to white to red`</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>Do you observe any tendency in the data?</li>
        <li>What is changing if we select `Plot the data as it is` in "Advanced - log transformation"?</li>
        <li>Can you generate an heatmap the normalized counts for the up-regulated genes with FC > 2?</li>
      </ul>
    backdrop: false
  - title: >-
      Analysis of the functional enrichment among the differentially expressed
      genes
    content: >-
      We have extracted genes that are differentially expressed in treated (with
      PS gene depletion) samples compared to untreated samples. We would like to
      know the functional enrichment among the differentially expressed
      genes.<br><br>

      <a href="http://www.geneontology.org/">Gene Ontology (GO)</a> analysis is
      widely used to reduce complexity and highlight biological processes in
      genome-wide expression studies, but standard methods give biased results
      on RNA-seq data due to over-detection of differential expression for long
      and highly expressed transcripts.

      <br><br><a
      href="https://bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf">goseq
      tool</a> provides methods for performing GO analysis of RNA-seq data,
      taking length bias into account. The methods and software used by goseq
      are equally applicable to other category based tests of RNA-seq data, such
      as KEGG pathway analysis.
    backdrop: true
  - title: >-
      Analysis of the functional enrichment among the differentially expressed
      genes
    content: |-
      goseq needs 2 files as inputs:
      <ul>
        <li>A tabular file with the differentially expressed genes from all genes assayed in the RNA-seq experiment with 2 columns:<ul>
        <li>the Gene IDs (unique within the file)</li>
         <li>True (differentially expressed) or False (not differentially expressed)</li></ul></li>
        <li>A file with information about the length of a gene to correct for potential length bias in differentially expressed genes</li>
      </ul>
    backdrop: true
  - title: Prepare the datasets for GOSeq
    element: '#tool-search-query'
    content: Search for 'Compute' tool.
    placement: right
    textinsert: Compute
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: Click on the 'Compute' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fdevteam%2Fcolumn_maker%2FAdd_a_column1%2F1.1.0"]
        .tool-old-link
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Add expression" to `bool(c7<0.05)`</li>
        <li>"as a new column to" to DESeq summary file</li>
      </ul>
    position: right
  - title: Prepare the datasets for GOSeq
    element: '#tool-search-query'
    content: Search for 'Cut' tool.
    placement: right
    textinsert: Cut
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: Click on the 'Cut' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Ftext_processing%2Ftp_cut_tool%2F1.1.0"]
        .tool-old-link
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Cut columns" to `c1,c8`</li>
        <li>"Delimited by" to `Tab`</li>
        <li>"From" to the previously generated file</li>
      </ul>
    position: right
  - title: Prepare the datasets for GOSeq
    element: '#tool-search-query'
    content: Search for 'Change Case' tool.
    placement: right
    textinsert: Change Case
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: Click on the 'Change Case' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=ChangeCase"] .tool-old-link'
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"From" to the previously generated file</li>
        <li>"Change case of columns" to `c1`</li>
        <li>"Delimited by" to `Tab`</li>
        <li>"To" to `Upper case`</li>
      </ul>
    position: right
  - title: Prepare the datasets for GOSeq
    element: '#tool-search-query'
    content: Search for 'Gene length and GC content' tool.
    placement: right
    textinsert: Gene length and GC content
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: Click on the 'Gene length and GC content' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Flength_and_gc_content%2Flength_and_gc_content%2F0.1.1"]
        .tool-old-link
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Select a built-in GTF file or one from your history" to `Use a GTF from history`</li>
        <li>"Select a GTF file" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
        <li>"Select a built-in FASTA or one from your history" to `Use a built-in FASTA`</li>
        <li>"Select a FASTA file" to `Fly (Drosophila melanogaster): dm6 Full`</li>
        <li>"Output length file?" to `Yes`</li>
        <li>"Output GC content file?" to `No`</li>
      </ul>
    position: right
  - title: Prepare the datasets for GOSeq
    element: '#tool-search-query'
    content: Search for 'Change Case' tool.
    placement: right
    textinsert: Change Case
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: Click on the 'Change Case' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=ChangeCase"] .tool-old-link'
  - title: Prepare the datasets for GOSeq
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"From" to the previously generated file</li>
        <li>"Change case of columns" to `c1`</li>
        <li>"Delimited by" to `Tab`</li>
        <li>"To" to `Upper case`</li>
      </ul>
    position: right
  - title: >-
      Analysis of the functional enrichment among the differentially expressed
      genes
    content: We have now the two required files for goseq.
    backdrop: true
  - title: Perform GO analysis
    element: '#tool-search-query'
    content: Search for 'goseq' tool.
    placement: right
    textinsert: goseq
  - title: Perform GO analysis
    element: '#tool-search'
    content: Click on the 'goseq' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fgoseq%2Fgoseq%2F1.26.0"]
        .tool-old-link
  - title: Perform GO analysis
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Differentially expressed genes file" to first file generated on previous step</li>
        <li>"Gene lengths file" to second file generated on previous step</li>
        <li>"Gene categories" to `Get categories`</li>
        <li>"Select a genome to use" to `Fruit fly (dm6)`</li>
        <li>"Select Gene ID format" to `Ensembl Gene ID`</li>
        <li>"Select one or more categories" to `GO: Cellular Component`, `GO: Biological Process`, `GO: Molecular Function`</li>
      </ul>
    position: right
  - title: Perform GO analysis
    element: '.history-right-panel .list-items > *:first'
    content: >-
      goseq generates a big table with the following columns for each GO
      term:
      <ul>
        <li>category</li>
        <li>over_rep_pval</li>
        <li>under_rep_pval</li>
        <li>numDEInCat</li>
        <li>numInCat</li>
        <li>term</li>
        <li>ontology</li>
        <li>p.adjust.over_represented</li>
        <li>p.adjust.under_represented</li>
      </ul>
      To identify categories significantly enriched/unenriched below some
      <i>p</i>-value cutoff, it is necessary to use the adjusted p-value.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>How many GO terms are over represented? Under represented?</li>
        <li>How are the over represented GO terms divided between MF, CC and BP? And for under represented GO terms?</li>
      </ul>
    backdrop: false
  - title: Inference of the differential exon usage
    content: >-
      Next, we would like to know the differential exon usage between treated
      (PS depleted) and untreated samples using RNA-seq exon counts. We will
      rework on the mapping results we generated previously.<br><br>

      We will use <a
      href="https://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html">DEXSeq</a>.
      DEXSeq detects high sensitivity genes, and in many cases exons, that are
      subject to differential exon usage. But first, as for the differential
      gene expression, we need to count the number of reads mapping to the
      exons.
    backdrop: true
  - title: Counting the number of reads per exon
    element: '#tool-search-query'
    content: Search for 'DEXSeq-Count' tool.
    placement: right
    textinsert: DEXSeq-Count
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: Click on the 'DEXSeq-Count' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fdexseq%2Fdexseq_count%2F1.24.0.0"]
        .tool-old-link
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Mode of operation" to `Prepare annotation`</li>
        <li>"GTF file" to `Drosophila_melanogaster.BDGP6.87.gtf`</li>
      </ul>
    position: right
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: Click on the 'DEXSeq-Count' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fdexseq%2Fdexseq_count%2F1.24.0.0"]
        .tool-old-link
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"Mode of operation" to `Count reads`</li>
        <li>"Input bam file" to the STAR-generated `BAM` files (multiple datasets)</li>
        <li>"DEXSeq compatible GTF file" to the previously generated GTF file</li>
        <li>"Is library paired end?" to `Yes`</li>
        <li>"Is library strand specific?" to `No`</li>
        <li>"Skip all reads with alignment quality lower than the given minimum value" to `10`</li>
      </ul>
    position: right
  - title: Counting the number of reads per exon
    element: '.history-right-panel .list-items > *:first'
    content: >-
      DEXSeq generates a file similar to the one generated by featureCounts, but
      with counts for exons.
    position: left
  - title: Questions
    content: |-
      <ul>
        <li>Which exon has the most reads mapped to it for both samples?</li>
        <li>From which gene have these exon been extracted?</li>
        <li>Is there a connection to the previous result obtained with HTSeq-count?</li>
      </ul>
    backdrop: false
  - title: Differential exon usage
    content: >-
      DEXSeq usage is similar to DESeq2. It uses similar statistics to find
      differentially used exons.<br><br>

      As for DESeq2, in the previous step, we counted only reads that mapped to
      exons on chromosome 4 and for only one sample. To be able to identify
      differential exon usage induced by PS depletion, all datasets (3 treated
      and 4 untreated) must be analyzed following the same procedure. To save
      time, we did that for you. The results are available on <a
      href="https://dx.doi.org/10.5281/zenodo.1185122">Zenodo</a>:<ul>
        <li>The results of running DEXSeq-count in ‘Prepare annotation’ mode</li>
        <li>Seven count files generated in ‘Count reads’ mode</li></ul>
    backdrop: true
  - title: Uploading the new data
    element: '#tool-panel-upload-button .fa.fa-upload'
    content: We need to upload data. Open the Galaxy Upload Manager
    placement: right
    postclick:
      - '#tool-panel-upload-button .fa.fa-upload'
      - '#btn-reset'
  - title: Uploading the input data
    element: '#btn-new'
    content: Click on Paste/Fetch Data
    placement: right
    postclick:
      - '#btn-new'
  - title: Uploading the input data
    element: .upload-text-column .upload-text .upload-text-content.form-control
    content: Load the data into your history by providing the links
    placement: right
    textinsert: >-
      https://zenodo.org/record/1185122/files/Drosophila_melanogaster.BDGP6.87.dexseq.gtf
      https://zenodo.org/record/1185122/files/GSM461176_untreat_single.exon.counts
      https://zenodo.org/record/1185122/files/GSM461177_untreat_paired.exon.counts
      https://zenodo.org/record/1185122/files/GSM461178_untreat_paired.exon.counts
      https://zenodo.org/record/1185122/files/GSM461179_treat_single.exon.counts
      https://zenodo.org/record/1185122/files/GSM461180_treat_paired.exon.counts
      https://zenodo.org/record/1185122/files/GSM461181_treat_paired.exon.counts
      https://zenodo.org/record/1185122/files/GSM461182_untreat_single.exon.counts
    backdrop: false
  - title: Uploading the input data
    element: '#btn-start'
    content: Click on "Start" to start loading the data to history
    placement: right
    postclick:
      - '#btn-start'
  - title: Uploading the input data
    element: '#btn-close'
    content: >-
      The upload may take a while.<br> Hit the close button to close this
      window.
    placement: right
    postclick:
      - '#btn-close'
  - title: Differential exon usage
    element: '#tool-search-query'
    content: Search for 'DEXSeq-Count' tool.
    placement: right
    textinsert: DEXSeq-Count
  - title: Differential exon usage
    element: '#tool-search'
    content: Click on the 'DEXSeq-Count' tool to open it.
    placement: right
    postclick:
      - >-
        a[href$="/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fdexseq%2Fdexseq%2F1.24.0.0"]
        .tool-old-link
  - title: Differential exon usage 1/2
    element: '#tool-search'
    content: >-
      Run the tool with the following parameters:
      <ul>
        <li>"GTF file created from DEXSeq-Count tool" to `Drosophila_melanogaster.BDGP6.87.dexseq.gtf`</li>
        <li>For "1: Factor"
          <ul>
            <li>"Specify a factor name" to `condition`</li>
            <li>"Specify a factor level" to `treated`</li>
            <li>"Count file for factor level 1" to the exon count files (multiple datasets) with `treated` in name</li>
            <li>"Specify a factor level" to `untreated`</li>
            <li>"Count file for factor level 2" to the exon count files (multiple datasets) with `untreated` in name</li>
          </ul>
        </li>
      </ul>
    position: right
  - title: Differential exon usage 2/2
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>For "2: Factor"
          <ul>
            <li>"Specify a factor name" to `sequencing`</li>
            <li>"Specify a factor level" to `pe`</li>
            <li>"Count file for factor level 1" to the exon count files (multiple datasets) with `paired` in name</li>
            <li>"Specify a factor level" to `se`</li>
            <li>"Count file for factor level 2" to the exon count files (multiple datasets) with `single` in name</li>
          </ul>
        </li>
      </ul>
    position: right
  - title: Counting the number of reads per exon
    element: '.history-right-panel .list-items > *:first'
    content: |-
      DEXSeq generates a table with:
      <ul>
        <li>Exon identifiers</li>
        <li>Gene identifiers</li>
        <li>Exon identifiers in the Gene</li>
        <li>Mean normalized counts, averaged over all samples from both conditions</li>
        <li>Logarithm (to basis 2) of the fold change</li>
        <li>Standard error estimate for the log2 fold change estimate</li>
        <li><i>p</i>-value for the statistical significance of this change</li>
        <li><i>p</i>-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (<a href="https://en.wikipedia.org/wiki/False_discovery_rate">FDR</a>)</li>
      </ul>
    position: left
  - title: Counting the number of reads per exon
    element: '#tool-search-query'
    content: Search for 'Filter' tool.
    placement: right
    textinsert: Filter
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: Click on the 'Filter' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=Filter1"] .tool-old-link'
  - title: Counting the number of reads per exon
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"Filter" to the generated file</li>
        <li>"With following condition" to `c8<=0.05`</li>
      </ul>
    position: right
  - title: Questions
    content: |-
      <ul>
        <li>How many exons show a significant change in usage between these conditions?</li>
      </ul>
    backdrop: false
  - title: Annotation of the result tables with gene information
    content: >-
      Unfortunately, in the process of counting, we loose all the information of
      the gene except its identifiant. In order to get the information back to
      our final counting tables, we can use a tool to make the correspondance
      between identifiant and annotation.
    backdrop: true
  - title: Annotation of the result tables with gene information
    element: '#tool-search-query'
    content: Search for 'Annotate DE(X)Seq result' tool.
    placement: right
    textinsert: Annotate DE(X)Seq result
  - title: Annotation of the result tables with gene information
    element: '#tool-search'
    content: Click on the 'Annotate DE(X)Seq result' tool to open it.
    placement: right
    postclick:
      - 'a[href$="/tool_runner?tool_id=dexseq_annotate"] .tool-old-link'
  - title: Annotation of the result tables with gene information
    element: '#tool-search'
    content: |-
      Run the tool with the following parameters:
      <ul>
        <li>"annotation file" to `Drosophila_melanogaster.BDGP5.78.gtf`</li>
      </ul>
    position: right
  - title: Conclusion
    content: >-
      In this tutorial, we have analyzed real RNA sequencing data to extract
      useful information, such as which genes are up- or downregulated by
      depletion of the Pasilla gene and which genes are regulated by the Pasilla
      gene. To answer these questions, we analyzed RNA sequence datasets using a
      reference-based RNA-seq data analysis approach.
    backdrop: true
  - title: Key points
    content: |-
      <ul>
        <li>Using a spliced mapping tool for eukaryotic RNA seq data</li>
        <li>Running a differential gene expression with taking care of the factors to study</li>
        <li>Running a differential exon usage with taking care of the factors to study</li>
      </ul>
    backdrop: true