From 189c835ebbf421edeaa64839d89dd23cd37d5534 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 28 May 2024 17:17:35 -0700
Subject: [PATCH 01/13] Updates for issue #94
- fix issue #94
(https://github.com/nasa/GeneLab_Data_Processing/issues/94) by passing
unsorted raw Bismark alignment output to deduplication step
- updated versions of all tools to latest
- updated broken documentation links (updates to TrimGalore and Bismark
included enhancements/new locations for tool documentation)
---
.../GL-DPPD-7113.md | 109 +++++++++++-------
1 file changed, 69 insertions(+), 40 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index e1e9d2dc..6892d4ad 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -63,19 +63,21 @@ Jonathan Galazka (GeneLab Project Scientist)
# Software used
|Program|Version|Relevant Links|
-|:------|:-----:|------:|
-|FastQC| 0.11.9 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
-|MultiQC| 1.12 |[https://multiqc.info/](https://multiqc.info/)|
-|Cutadapt| 3.7 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
-|TrimGalore!| 0.6.7 |[https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)|
-|Bismark| 0.23.1 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
-|bowtie2| 2.4.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
-|hisat2| 2.2.1 | [https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
-|samtools| 1.16.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
-|qualimap| 2.2.2d |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
-|methylKit| 1.20.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
-|tidyverse| 1.3.2 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
-|genomation| 1.26.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
+|:-----------|:------:|------:|
+|FastQC | 0.12.0 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
+|MultiQC | 1.22.1 |[https://multiqc.info/](https://multiqc.info/)|
+|Cutadapt | 4.8 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
+|TrimGalore! | 0.6.7 |[https://github.com/FelixKrueger/TrimGalore](https://github.com/FelixKrueger/TrimGalore)|
+|Bismark | 0.24.2 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
+|bowtie2 | 2.5.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
+|hisat2 | 2.2.1 |[https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
+|samtools | 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+|qualimap | 2.3 |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
+|R | 4.4.0 |[https://www.r-project.org](https://www.r-project.org)|
+|tidyverse | 2.0.0 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
+|Bioconductor| 3.19 |[https://bioconductor.org](https://bioconductor.org)
+|methylKit | 1.30.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
+|genomation | 1.36.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
---
@@ -139,12 +141,12 @@ multiqc --interactive -o raw_multiqc_data/ -n raw_multiqc -z raw_fastqc_output/
---
## 2. Adapter trimming/quality filtering
-See `trim_galore --help` [menu](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3035) for more info on any of the below.
+See `trim_galore --help` or [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md) for more info on any of the below.
### If not RRBS or if RRBS using MseI digestion
-Note that the `--rrbs` option is **not** appropriate when RRBS (reduced representation bisulfite sequencing) libraries were prepared with MseI digestion (see `trim_galore --help` menu [(starting at this line)](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3337).
+Note that the `--rrbs` option is **not** appropriate when RRBS (reduced representation bisulfite sequencing) libraries were prepared with MseI digestion (see the TrimGalore User Guide [Note for RRBS using MseI](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material).
**Single-end example**
@@ -177,7 +179,7 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
### If RRBS with MspI digestion
-Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end; see `trim_galore --help` menu [e.g., here](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3315)).
+Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end; see [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material)).
**Single-end example**
@@ -214,7 +216,7 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
### If RRBS with NuGEN ovation kit
Libraries prepared with the NuGEN ovation kit need to be procesed with an additional script provided by the company's [github](https://github.com/nugentechnologies/NuMetRRBS#analysis-guide-for-nugen-ovation-rrbs-methyl-seq).
-Following their instructions, we first run an adapter-trimming/quality-filtering step with trimgalore. Note that the `--rrbs` option is not appropriate to pass to trimgalore when this kit is used (see `trim_galore --help` menu [(starting at this line)](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3329). Then we utilize the company's script to remove the random diversity sequences added by the kit.
+Following their instructions, we first run an adapter-trimming/quality-filtering step with trimgalore. Note that the `--rrbs` option is not appropriate to pass to trimgalore when this kit is used (see Bismark documentation for [RRBS NuGEN Ovation Methyl-Seq System](http://felixkrueger.github.io/Bismark/bismark/library_types/#rrbs-nugen-ovation-methyl-seq-system). Then we utilize the company's script to remove the random diversity sequences added by the kit.
#### First adapter-trimming/quality-filtering with trimgalore
@@ -424,7 +426,7 @@ bismark_genome_preparation --bowtie2 \
### 4b. Align
-Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end).
+Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end). For a full list of alignment option recommendations library type and/or commercially available kit, please see the library page in the [Bismark documentation](http://felixkrueger.github.io/Bismark/bismark/library_types/)
**Single-end example**
@@ -438,8 +440,8 @@ bismark --bowtie2 \
--genome_folder bismark_reference_genome/ \
sample-1_trimmed.fastq.gz
-# renaming output files so they are cleaner and will work with sorted bam file/auto-detection
-# of bismark2summary later
+# renaming output files so they are cleaner and will work with sorted bam
+# file/auto-detection of bismark2summary later
mv sample-1_trimmed_bismark_bt2_SE_report.txt sample-1_bismark_bt2_sorted_SE_report.txt
mv sample-1_trimmed_bismark_bt2.nucleotide_stats.txt sample-1_bismark_bt2.nucleotide_stats.txt
mv sample-1_trimmed_bismark_bt2.bam sample-1_bismark_bt2.bam
@@ -458,8 +460,8 @@ bismark --bowtie2 \
-1 sample-1_R1_trimmed.fastq.gz \
-2 sample-1_R2_trimmed.fastq.gz
-# renaming output files so they are cleaner and will work with sorted bam file/auto-detection
-# of bismark2summary later
+# renaming output files so they are cleaner and will work with sorted bam
+# file/auto-detection of bismark2summary later
mv sample-1_R1_trimmed_bismark_bt2_PE_report.txt sample-1_bismark_bt2_sorted_PE_report.txt
mv sample-1_R1_trimmed_bismark_bt2.nucleotide_stats.txt sample-1_bismark_bt2.nucleotide_stats.txt
mv sample-1_R1_trimmed_bismark_bt2_pe.bam sample-1_bismark_bt2_pe.bam
@@ -557,28 +559,53 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
---
-## 6. Deduplicate (skip if data are RRBS)
+## 6a. Deduplicate (skip if data are RRBS)
> **NOTE**
-> This step should **not** be done if the data are RRBS (reduced representation bisulfite sequencing; see e.g., [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs#iii-running-deduplicate_bismark)).
+> This step should **not** be done if the data are RRBS (reduced representation bisulfite sequencing; see e.g., [bismark documentation](https://felixkrueger.github.io/Bismark/bismark/deduplication/)).
```bash
-deduplicate_bismark sample-1_bismark_bt2_sorted.bam
+deduplicate_bismark sample-1_bismark_bt2.bam
```
**Parameter Definitions:**
-* sample-1_bismark_bt2_sorted.bam - the input bam file, provided as a positional argument
+* sample-1_bismark_bt2.bam - the input bam file, provided as a positional argument
**Input data:**
-* sample-1_bismark_bt2_sorted.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above)
+* sample-1_bismark_bt2.bam (unsorted bismark bowtie2 alignment bam file, output from [Step 4b](#4b-align) above)
**Output data:**
-* **\*.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicates removed)
+* \*.deduplicated.bam (unsorted bismark bowtie2 alignment bam file, with duplicates removed)
* **\*.deduplication_report.txt** (report file containing deduplication information)
+
+
+### 6b. Sort Deduplicated Alignment Files
+
+```bash
+samtools sort -@ 4 \
+ -o sample-1_bismark_bt2_sorted.deduplicated.bam \
+ sample-1_bismark_bt2.deduplicated.bam
+```
+
+**Parameter Definitions:**
+
+* `sort` - specifies to use the `sort` sub-program of `samtools`
+* `-@` - specifies the number of threads to use
+* `-o` - specifies the output file name
+* sample-1_bismark_bt2.deduplicated.bam - the input bam file, provided as a positional argument
+
+**Input data:**
+
+* sample-1_bismark_bt2.deduplicated.bam (bismark bowtie2 alignment bam file, output from [Step 6a.](#6a-deduplicate-skip-if-data-are-rrbs) above)
+
+**Output data:**
+
+* **sample-1_bismark_bt2_sorted.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicated removed)
+
---
@@ -597,8 +624,9 @@ bismark_methylation_extractor --parallel 4 \
--cytosine_report \
--genome_folder bismark_reference_genome/ \
sample-1_bismark_bt2_sorted.bam
- # note, if *not working with RRBS data, input should be the
- # deduplicated version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in step 6 above
+ # note, if *not working with RRBS data, input should be the deduplicated
+ # version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
+ # step 6 above
```
**Paired-end example**
@@ -614,37 +642,38 @@ bismark_methylation_extractor --parallel 4 \
--ignore_r2 2 \
--ignore_3prime_r2 2 \
sample-1_bismark_bt2_sorted.bam
- # note, if *not working with RRBS data, input should be the
- # deduplicated version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in step 6 above
+ # note, if *not working with RRBS data, input should be the deduplicated
+ # version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
+ # step 6 above
```
**Parameter Definitions:**
* `--parallel` - specifies the number of cores to use for methylation extraction, note: the program will utilize ~3X the number specified
-* `--bedGraph` - instructs the program to generate a sorted bedGraph file that reports the position of a given cytosine and its methlyation state (by default, only methylated CpGs are reported - see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output) for more info)
+* `--bedGraph` - instructs the program to generate a sorted bedGraph file that reports the position of a given cytosine and its methlyation state (by default, only methylated CpGs are reported - see bedgraph options in [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-specific-options) for more info)
* `--gzip` - specifies to gzip-compress the methylation extractor output files
* `--comprehensive` - specifies to merge all four possible strand-specific methylation info into context-dependent output files
* `--output_dir` - the output directory to store results
* `--cytosine_report` - instructions the program to produce a genome-wide methylation report for all cytosines in the genome
* `--genome_folder` - a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
-* `--ignore_r2` - specifies how many bases to ignore from the 5' end of the reverse reads (bismark docs recommend 2, see [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#appendix-iii-bismark-methylation-extractor))
+* `--ignore_r2` - specifies how many bases to ignore from the 5' end of the reverse reads (bismark docs recommend 2, see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#options))
> Note: The first couple of bases in the reverse read of bisulfite sequence experiments show a severe bias towards non-methylation as a result of end-reparing sonicated fragments with unmentulated cytosines, so it is recommened to remove the first couple basepairs
-* `--ignore_3prime_r2` - specifies how many bases to ignore from the 3' end of the reverse reads to remove unwanted biases from the end of reads (this is utilized in the [nf-core methylseq workflow](https://nf-co.re/methylseq), set at [this line](https://github.com/nf-core/methylseq/blob/03972a686bedeb2920803cd575f4d671e9135af0/main.nf#L643))
+* `--ignore_3prime_r2` - specifies how many bases to ignore from the 3' end of the reverse reads to remove unwanted biases from the end of reads (For specific recommnendations see Bismark documentation on [Library Types](https://felixkrueger.github.io/Bismark/bismark/library_types/))
* sample-1_bismark_bt2_sorted.bam - the input bam file, provided as a positional argument
**Input data:**
-* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6](#6-deduplicate-skip-if-data-are-rrbs) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
+* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6](#6b-sort-deduplicated-alignment-files) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
**Output data:**
-* **\*\_context\_\*.txt.gz** (bismark methylation-call files for CpG, CHG, and CHH contexts that were detected; see [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs), namely [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#methylation-call) for symbols, and [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#iv-bismark-methylation-extractor) for file format)
-* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
-* **\*.bismark.cov.gz** (gzip-compressed bedGraph-formatted file like above "\*.bedGraph.gz", but also including 2 more columns of methylated and unmethylated counts at the specified position; see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
-* **\*.M-bias.txt** (text file with methylation information in the context of the position in reads, helpful for investigating bias as a function of base position in the read; see bismark documentation [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#m-bias-plot))
+* **\*\_context\_\*.txt.gz** (bismark methylation-call files for CpG, CHG, and CHH contexts that were detected; see [bismark documentation](https://felixkrueger.github.io/Bismark/), namely [methylation call](http://felixkrueger.github.io/Bismark/bismark/alignment/#methylation-call) for symbols, and [methylation extraction output](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#the-methylation-extractor-output-looks-like-this-tab-separated) for file format)
+* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
+* **\*.bismark.cov.gz** (gzip-compressed bedGraph-formatted file like above "\*.bedGraph.gz", but also including 2 more columns of methylated and unmethylated counts at the specified position; see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-specific-options))
+* **\*.M-bias.txt** (text file with methylation information in the context of the position in reads, helpful for investigating bias as a function of base position in the read; see [bismark documentation](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#m-bias-plot))
* **\*_splitting_report.txt** (text file containing general methylation detection information)
* **\*.cytosine_context_summary.txt** (tsv file of detected cytosine-methylation information summed by nucleotide context)
* **\*.CpG_report.txt.gz** (a genome-wide methylation report for all CpG cytosines)
From 6599dc6d2869e2c421df49f9320a6ceb80dfcc95 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 29 May 2024 18:16:53 -0700
Subject: [PATCH 02/13] fixed formatting and links for deduplication step
---
.../Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 6892d4ad..41a7ecdf 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -43,6 +43,8 @@ Jonathan Galazka (GeneLab Project Scientist)
- [4c. Sort Alignment Files](#4c-sort-alignment-files)
- [5. Alignment QC](#5-alignment-qc)
- [6. Deduplicate (skip if data are RRBS)](#6-deduplicate-skip-if-data-are-rrbs)
+ - [6a. Deduplicate](#6a-deduplicate)
+ - [6b. Sort Deduplicated Alignment Files](#6b-sort-deduplicated-alignment-files)
- [7. Extract methylation calls](#7-extract-methylation-calls)
- [8. Generate individual sample report](#8-generate-individual-sample-report)
- [9. Generate combined summary report](#9-generate-combined-summary-report)
@@ -559,10 +561,14 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
---
-## 6a. Deduplicate (skip if data are RRBS)
+## 6. Deduplicate (skip if data are RRBS)
> **NOTE**
> This step should **not** be done if the data are RRBS (reduced representation bisulfite sequencing; see e.g., [bismark documentation](https://felixkrueger.github.io/Bismark/bismark/deduplication/)).
+
+
+### 6a. Deduplicate
+
```bash
deduplicate_bismark sample-1_bismark_bt2.bam
```
@@ -664,7 +670,7 @@ bismark_methylation_extractor --parallel 4 \
**Input data:**
-* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6](#6b-sort-deduplicated-alignment-files) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
+* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6b](#6b-sort-deduplicated-alignment-files) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
From 4bb28baa1b2d73752722652d4467c4a656d0afde Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 5 Jun 2024 15:52:35 -0700
Subject: [PATCH 03/13] Added additional report step
- Added an alignment multiqc step that gathers all qualimap and bismark alignment QC reports (Step 9b)
- Added trimgalore reports to trimmed multiqc report
- Added report suffixes (GLMethylSeq) to all multiqc reports
- Updated multicore specification from "4" to "NumberOfThreads"
- Updated filename specifications to more specifically call out the different filenames bismark generates for paired-end/single-end, bowtie2/hisat2
- Moved generation of nucleotide frequencies file from alignment step to genome reference generation step
- Updated R library versions
---
.../GL-DPPD-7113.md | 179 ++++++++++++------
1 file changed, 121 insertions(+), 58 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 41a7ecdf..734c070c 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -47,7 +47,9 @@ Jonathan Galazka (GeneLab Project Scientist)
- [6b. Sort Deduplicated Alignment Files](#6b-sort-deduplicated-alignment-files)
- [7. Extract methylation calls](#7-extract-methylation-calls)
- [8. Generate individual sample report](#8-generate-individual-sample-report)
- - [9. Generate combined summary report](#9-generate-combined-summary-report)
+ - [9. Generate combined summary reports](#9-generate-combined-summary-reports)
+ - [9a. Bismark summary report](#9a-bismark-summary-report)
+ - [9b. Compile Alignment and Bismark QC](#9b-compile-alignment-and-bismark-qc)
- [10. Generate reference genome annotation information](#10-generate-reference-genome-annotation-information)
- [10a. GTF to BED conversion](#10a-gtf-to-bed-conversion)
- [10b. Making a mapping file of genes to transcripts](#10b-making-a-mapping-file-of-genes-to-transcripts)
@@ -118,7 +120,7 @@ fastqc -o raw_fastqc_output/ *raw.fastq.gz
### 1b. Compile Raw Data QC
```bash
-multiqc --interactive -o raw_multiqc_data/ -n raw_multiqc -z raw_fastqc_output/
+multiqc --interactive -o raw_multiqc_GLmethylSeq_data/ -n raw_multiqc_GLmethylSeq -z raw_fastqc_output/
```
**Parameter Definitions:**
@@ -135,8 +137,8 @@ multiqc --interactive -o raw_multiqc_data/ -n raw_multiqc -z raw_fastqc_output/
**Output data:**
-* **raw_multiqc.html** (multiqc output html summary)
-* **raw_multiqc_data.zip** (zipped directory containing multiqc output data)
+* **raw_multiqc_GLMethylSeq.html** (multiqc output html summary)
+* **raw_multiqc_GLMethylSeq_data.zip** (zipped directory containing multiqc output data)
@@ -154,7 +156,7 @@ Note that the `--rrbs` option is **not** appropriate when RRBS (reduced represen
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
--output_dir trimmed_reads_out_dir/ \
sample-1_raw.fastq.gz
@@ -167,7 +169,7 @@ mv sample-1_raw_trimmed.fq.gz sample-1_trimmed.fastq.gz
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
--output_dir trimmed_reads_out_dir/ \
--paired \
@@ -187,7 +189,7 @@ Note that if the library preparation was non-directional, the `--non_directional
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
--rrbs \
--output_dir trimmed_reads_out_dir/ \
@@ -201,7 +203,7 @@ mv sample-1_raw_trimmed.fq.gz sample-1_trimmed.fastq.gz
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
--rrbs \
--output_dir trimmed_reads_out_dir/ \
@@ -226,7 +228,7 @@ Following their instructions, we first run an adapter-trimming/quality-filtering
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
-a AGATCGGAAGAGC \
--output_dir trimmed_reads_out_dir/ \
@@ -240,7 +242,7 @@ mv sample-1_raw_trimmed.fq.gz sample-1_trimmed.fastq.gz
```bash
trim_galore --gzip \
- --cores 4 \
+ --cores NumberOfThreads \
--phred33 \
--paired \
-a AGATCGGAAGAGC \
@@ -346,7 +348,7 @@ fastqc -o trimmed_fastqc_output/ *trimmed.fastq.gz
### 3b. Compile Trimmed Data QC
```bash
-multiqc --interactive -o trimmed_multiqc_data/ -n trimmed_multiqc -z trimmed_fastqc_output/
+multiqc --interactive -o trimmed_multiqc_GLmethylSeq_data/ -n trimmed_multiqc_GLmethylSeq -z trimmed_fastqc_output/ trimgalore_output/
```
**Parameter Definitions:**
@@ -356,6 +358,7 @@ multiqc --interactive -o trimmed_multiqc_data/ -n trimmed_multiqc -z trimmed_fas
* `-n` – the filename prefix for output files
* `-z` – specifies to zip the output data directory
* `trimmed_fastqc_output/` – the directory holding the output data from the fastqc run, provided as a positional argument
+* `trimgalore_output/` - the directory holding the trimgalore trimming reports.
**Input data:**
@@ -363,8 +366,8 @@ multiqc --interactive -o trimmed_multiqc_data/ -n trimmed_multiqc -z trimmed_fas
**Output data:**
-* **trimmed_multiqc.html** (multiqc output html summary)
-* **trimmed_multiqc_data** (directory containing multiqc output data)
+* **trimmed_multiqc_GLMethylSeq.html** (multiqc output html summary)
+* **trimmed_multiqc_GLMethylSeq_data.zip** (directory containing multiqc output data)
@@ -383,16 +386,22 @@ mkdir bismark_reference_genome
mv ref-genome.fasta bismark_reference_genome/
bismark_genome_preparation --bowtie2 \
- --parallel 4 \
+ --parallel NumberOfThreads \
bismark_reference_genome/
+
+bam2nuc --genome_folder bismark_reference_genome/ --genomic_composition_only
```
**Parameter Definitions:**
+*bismark_genome_preparation*
* `--bowtie2` - specify bismark to create bisulfite indexes for use with Bowtie2
* `--parallel` – specifies how many threads to use (note these will be doubled as it operates on both strands simultaneously)
* positional argument specifing the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
+*bam2nuc*
+* --genome_folder - species the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
+* --genomic_composition_only - species creation of the genomic_nucleotide_frequencies.txt report, which is genome rathe than sample specific.
**Input data:**
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
@@ -418,6 +427,7 @@ bismark_genome_preparation --bowtie2 \
* BS_GA.rev.2.bt2
* genome_mfa.GA_conversion.fa
* \*.txt (captured standard output from the command)
+* bismark_reference_genome/genomic_nucleotide_frequencies.txt
@@ -435,7 +445,7 @@ Note that if the library preparation was non-directional, the `--non_directional
```bash
bismark --bowtie2 \
--bam \
- --parallel 4 \
+ --parallel NumberOfThreads \
--non_bs_mm \
--nucleotide_coverage \
--output_dir mapping_files_out_dir/ \
@@ -454,7 +464,7 @@ mv sample-1_trimmed_bismark_bt2.bam sample-1_bismark_bt2.bam
```bash
bismark --bowtie2 \
--bam \
- --parallel 4 \
+ --parallel NumberOfThreads \
--non_bs_mm \
--nucleotide_coverage \
--output_dir mapping_files_out_dir/ \
@@ -489,23 +499,22 @@ mv sample-1_R1_trimmed_bismark_bt2_pe.bam sample-1_bismark_bt2_pe.bam
**Output data:**
-* sample-1_bismark_bt2.bam (mapping file)
-* **\*report.txt** (bismark mapping report file)
-* **genomic_nucleotide_frequencies.txt** (tab-delimited table of mono- and di-nucleotide frequencies in reference genome)
+* sample-1_bismark_{bt2,bt2_pe,hisat2,hisat2_pe}.bam (mapping file)
+* **\*_[SP]E_report.txt** (bismark mapping report file)
* **\*.nucleotide_stats.txt** (tab-delimited table with sample-specific mono- and di-nucleotide sequence compositions and coverage values compared to genomic compositions
> **NOTE**
-> If using RNA, need to add the `--hisat2` and `--path_to_hisat2` flags.
+> If using RNA, need to add the `--hisat2` and `--path_to_hisat2` flags. Output files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
### 4c. Sort Alignment Files
```bash
-samtools sort -@ 4 \
+samtools sort -@ NumberOfThreads \
-o sample-1_bismark_bt2_sorted.bam \
- sample-1_bismark_bt2.bam
+ sample-1_bismark_{bt2,bt2_pe}.bam
```
**Parameter Definitions:**
@@ -513,11 +522,13 @@ samtools sort -@ 4 \
* `sort` - specifies to use the `sort` sub-program of `samtools`
* `-@` - specifies the number of threads to use
* `-o` - specifies the output file name
-* sample-1_bismark_bt2.bam - the input bam file, provided as a positional argument
+* sample-1_bismark_{bt2,bt2_pe}.bam - the input bam file, provided as a positional argument
**Input data:**
-* sample-1_bismark_bt2.bam (bismark bowtie2 alignment bam file, output from [Step 4b.](#4b-align) above)
+* sample-1_bismark_bt2.bam (bismark alignment bam file, output from [Step 4b.](#4b-align) above)
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
@@ -535,7 +546,7 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
-outdir sample-1_bismark_bt2_qualimap/ \
--collect-overlap-pairs \
--java-mem-size=6G \
- -nt 4
+ -nt NumberOfThreads
```
**Parameter Definitions:**
@@ -550,12 +561,14 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
**Input data:**
-* sample-1_bismark_bt2_sorted.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above)
+* sample-1_bismark_{bt2,hisat2}_sorted.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above)
* a feature file contining regions of interest for the reference genome in gtf format (this pipeline version uses the Ensembl fasta file indicated in the `gtf` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
-* **\*sample-1_bismark_bt2_qualimap/** (subdirectory of many alignment QC output files and formatting files for presenting in an html file (see [qualimap documentation](http://qualimap.conesalab.org/doc_html/analysis.html#output))
+* **\*sample-1_bismark_{bt2,hisat2}_qualimap/** (subdirectory of many alignment QC output files and formatting files for presenting in an html file (see [qualimap documentation](http://qualimap.conesalab.org/doc_html/analysis.html#output))
@@ -570,16 +583,18 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
### 6a. Deduplicate
```bash
-deduplicate_bismark sample-1_bismark_bt2.bam
+deduplicate_bismark sample-1_bismark_{bt2,bt2_pe}.bam
```
**Parameter Definitions:**
-* sample-1_bismark_bt2.bam - the input bam file, provided as a positional argument
+* sample-1_bismark_{bt2,bt2_pe}.bam - the input bam file, provided as a positional argument
**Input data:**
-* sample-1_bismark_bt2.bam (unsorted bismark bowtie2 alignment bam file, output from [Step 4b](#4b-align) above)
+* sample-1_bismark_{bt2,bt2_pe}.bam (unsorted bismark bowtie2 alignment bam file, output from [Step 4b](#4b-align) above)
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
@@ -592,9 +607,9 @@ deduplicate_bismark sample-1_bismark_bt2.bam
### 6b. Sort Deduplicated Alignment Files
```bash
-samtools sort -@ 4 \
+samtools sort -@ NumberOfThreads \
-o sample-1_bismark_bt2_sorted.deduplicated.bam \
- sample-1_bismark_bt2.deduplicated.bam
+ sample-1_bismark_{bt2,bt2_pe}.deduplicated.bam
```
**Parameter Definitions:**
@@ -606,7 +621,9 @@ samtools sort -@ 4 \
**Input data:**
-* sample-1_bismark_bt2.deduplicated.bam (bismark bowtie2 alignment bam file, output from [Step 6a.](#6a-deduplicate-skip-if-data-are-rrbs) above)
+* sample-1_bismark_{bt2,bt2_pe}.deduplicated.bam (bismark bowtie2 alignment bam file, output from [Step 6a.](#6a-deduplicate-skip-if-data-are-rrbs) above)
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
@@ -622,23 +639,23 @@ samtools sort -@ 4 \
**Single-end example**
```bash
-bismark_methylation_extractor --parallel 4 \
+bismark_methylation_extractor --parallel NumberOfThreads \
--bedGraph \
--gzip \
--comprehensive \
--output_dir methylation_calls_out_dir/ \
--cytosine_report \
--genome_folder bismark_reference_genome/ \
- sample-1_bismark_bt2_sorted.bam
+ sample-1_bismark_bt2.bam
# note, if *not working with RRBS data, input should be the deduplicated
- # version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
+ # version (sample-1_bismark_bt2*.deduplicated.bam) produced in
# step 6 above
```
**Paired-end example**
```bash
-bismark_methylation_extractor --parallel 4 \
+bismark_methylation_extractor --parallel NumberOfThreads \
--bedGraph \
--gzip \
--comprehensive \
@@ -647,9 +664,9 @@ bismark_methylation_extractor --parallel 4 \
--genome_folder bismark_reference_genome/ \
--ignore_r2 2 \
--ignore_3prime_r2 2 \
- sample-1_bismark_bt2_sorted.bam
+ sample-1_bismark_bt2_pe.bam
# note, if *not working with RRBS data, input should be the deduplicated
- # version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
+ # version (sample-1_bismark_bt2*.deduplicated.bam) produced in
# step 6 above
```
@@ -670,8 +687,10 @@ bismark_methylation_extractor --parallel 4 \
**Input data:**
-* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6b](#6b-sort-deduplicated-alignment-files) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
+* sample-1_bismark_bt2*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4b](#4b-align) above if data are RRBS, or deduplicated bam file from [step 6a](#6a-deduplicate) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2.deduplicated.bam from above))
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
@@ -693,9 +712,9 @@ bismark_methylation_extractor --parallel 4 \
```bash
bismark2report --dir sample-1_bismark_report_out_dir/ \
- --alignment_report sample-1_bismark_bt2_sorted_SE_report.txt \
- --splitting_report sample-1_bismark_bt2_sorted_splitting_report.txt \
- --mbias_report sample-1_bismark_bt2_sorted.M-bias.txt
+ --alignment_report sample-1_bismark_bt2_[SP]E_report.txt \
+ --splitting_report sample-1_bismark_{bt2,bt2_pe}_splitting_report.txt \
+ --mbias_report sample-1_bismark_{bt2,bt2_pe}.M-bias.txt
```
**Parameter Definitions:**
@@ -707,9 +726,11 @@ bismark2report --dir sample-1_bismark_report_out_dir/ \
**Input data:**
-* sample-1_bismark_bt2_sorted_SE_report.txt (bismark mapping report file, output from [Step 4b.](#4b-align))
-* sample-1_bismark_bt2_sorted_splitting_report.txt (splitting report file, output from [Step 7](#7-extract-methylation-calls) above)
-* sample-1_bismark_bt2_sorted.M-bias.txt (text file with methylation information in the context of the position in reads, output from [Step 7](#7-extract-methylation-calls) above)
+* sample-1_bismark_bt2_[SP]E_report.txt (bismark mapping report file, output from [Step 4b.](#4b-align))
+* sample-1_bismark_{bt2,bt2_pe}_splitting_report.txt (splitting report file, output from [Step 7](#7-extract-methylation-calls) above)
+* sample-1_bismark_{bt2,bt2_pe}.M-bias.txt (text file with methylation information in the context of the position in reads, output from [Step 7](#7-extract-methylation-calls) above)
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
> **NOTE**
> If data are **not** RRBS, the deduplication report from [step 6](#6-deduplicate-skip-if-data-are-rrbs) above should also be provided to the above command to the `--dedup_report` parameter
@@ -722,20 +743,24 @@ bismark2report --dir sample-1_bismark_report_out_dir/ \
---
-## 9. Generate combined summary report
+## 9. Generate combined summary reports
+
+### 9a. Bismark Summary Report
```bash
-bismark2summary sample-1_bismark_bt2_sorted.bam
+bismark2summary sample-1_bismark_{bt2,bt2_pe}.bam
```
**Input data:**
-* autodetects appropriate files based on the sorted bam files
+* autodetects appropriate files based on the input bam files
* the positional argument(s) can either be explicitly naming the bam file(s) as done above or can be the top-level directory holding the initial bam files and relevant report files
* the autodetected files cannot be explicitly provided, but it looks for those named like these listed here and includes them if they exist for each individual starting bam file it is given or finds
- * sample-1_bismark_bt2_sorted_SE_report.txt generated from [Step 4b.](#4b-align) above
- * sample-1_bismark_bt2_sorted_splitting_report.txt from [Step 7](#7-extract-methylation-calls) above
- * and the deduplication report files if deduplication was performed in [Step 6](#6-deduplicate-skip-if-data-are-rrbs)
+ * sample-1_bismark_bt2_[SP]E_report.txt generated from [Step 4b.](#4b-align) above
+ * sample-1_bismark_{bt2,bt2_pe}_splitting_report.txt from [Step 7](#7-extract-methylation-calls) above
+ * sample-1_bismark_{bt2,bt2_pe}.deduplication_report.txt if deduplication was performed in [Step 6](#6-deduplicate-skip-if-data-are-rrbs)
+> **NOTE**
+> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
**Output data:**
@@ -744,6 +769,43 @@ bismark2summary sample-1_bismark_bt2_sorted.bam
+---
+
+### 9b. Compile Alignment and Bismark QC
+
+```bash
+multiqc --interactive -o align_multiqc_data/ -n align_multiqc -z \
+ qualimap_out_dir/ mapping_files_out_dir/ methylation_calls_out_dir/ deduplication_out_dir/
+```
+
+**Parameter Definitions:**
+
+* `--interactive` - force reports to use interactive plots
+* `-o` – the output directory to store results
+* `-n` – the filename prefix for output files
+* `-z` – specifies to zip the output data directory
+* `qualimap_out_dir/` – the directory holding the output data from the qualimap run, provided as a positional argument
+* `methylation_calls_out_dir/` – the directory holding the output data from the methylation extraction run, provided as a positional argument
+* `mapping_files_out_dir/` – the directory holding the output data from the fastqc run, provided as a positional argument
+* `deduplication_out_dir/` – the directory holding the output data from the deduplication run, provided as a positional argument (omitted if RRBS data)
+
+**Input data:**
+
+* qualimap_out_dir/*.txt (qualimap output results from [Step 5](#5-alignment-qc))
+* mapping_files_out_dir/*_[SP]E_report.txt (Bismark alignment results from [Step 4b](#4b-align))
+* mapping_files_out_dir/*.nucleotide_stats.txt (Bismark nucleotide stats results from [Step 4b](#4b-align))
+* deduplication_out_dir/*.deduplication_report.txt (Bismark deduplication results from [Step 6a](#6a-deduplicate) if not RRBS)
+* methylation_calls_out_dir/*_splitting_report.txt (methylation extraction results from [Step 7](#7-extract-methylation-calls))
+* methylation_calls_out_dir/*M-bias.txt (methylation bias results from [Step 7](#7-extract-methylation-calls))
+
+**Output data:**
+
+* **align_multiqc_GLmethylSeq.html** (multiqc output html summary)
+* **align_multiqc_GLmethylSeq_data** (directory containing multiqc output data)
+
+
+
+
---
## 10. Generate reference genome annotation information
@@ -784,7 +846,7 @@ awk ' $3 == "transcript" ' reference.gtf | cut -f 9 | tr -s ";" "\t" | \
**Output data:**
-* **reference-gene-to-transcript-map.tsv** (a gene-to-transcript mapping file with gene IDs in the first column and transcript IDs in the second)
+* **reference-gene-to-transcript-map-GLmethylSeq.tsv** (a gene-to-transcript mapping file with gene IDs in the first column and transcript IDs in the second)
@@ -811,12 +873,13 @@ The remainder of this document is performed in R.
### Install R packages if not already installed ###
install.packages("remotes")
-remotes::install_version("tidyverse", version = "1.3.2")
-
-if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
+remotes::install_version("tidyverse", version = "2.0.0")
-BiocManager::install("methylKit", version = "3.14")
-BiocManager::install("genomation", version = "3.14")
+if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
+BiocManager::install(version = "3.19")
+BiocManager::install("methylKit", version = "3.19")
+BiocManager::install("genomation", version = "3.19")
### Import libraries (###
From fa107a074398e876fc3045aab89e68565f4a48c4 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Wed, 10 Jul 2024 13:54:54 -0700
Subject: [PATCH 04/13] Typo and formatting changes
---
.../GL-DPPD-7113.md | 49 ++++++++++++-------
1 file changed, 30 insertions(+), 19 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 734c070c..1bcff34c 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -120,7 +120,11 @@ fastqc -o raw_fastqc_output/ *raw.fastq.gz
### 1b. Compile Raw Data QC
```bash
-multiqc --interactive -o raw_multiqc_GLmethylSeq_data/ -n raw_multiqc_GLmethylSeq -z raw_fastqc_output/
+multiqc --interactive \
+ -o raw_multiqc_GLmethylSeq_data/ \
+ -n raw_multiqc_GLmethylSeq \
+ -z \
+ raw_fastqc_output/
```
**Parameter Definitions:**
@@ -348,7 +352,11 @@ fastqc -o trimmed_fastqc_output/ *trimmed.fastq.gz
### 3b. Compile Trimmed Data QC
```bash
-multiqc --interactive -o trimmed_multiqc_GLmethylSeq_data/ -n trimmed_multiqc_GLmethylSeq -z trimmed_fastqc_output/ trimgalore_output/
+multiqc --interactive \
+ -o trimmed_multiqc_GLmethylSeq_data/ \
+ -n trimmed_multiqc_GLmethylSeq \
+ -z \
+ trimmed_fastqc_output/ trimgalore_output/
```
**Parameter Definitions:**
@@ -358,7 +366,7 @@ multiqc --interactive -o trimmed_multiqc_GLmethylSeq_data/ -n trimmed_multiqc_GL
* `-n` – the filename prefix for output files
* `-z` – specifies to zip the output data directory
* `trimmed_fastqc_output/` – the directory holding the output data from the fastqc run, provided as a positional argument
-* `trimgalore_output/` - the directory holding the trimgalore trimming reports.
+* `trimgalore_output/` - the directory holding the trimgalore trimming reports, provided as a positional argument
**Input data:**
@@ -389,7 +397,8 @@ bismark_genome_preparation --bowtie2 \
--parallel NumberOfThreads \
bismark_reference_genome/
-bam2nuc --genome_folder bismark_reference_genome/ --genomic_composition_only
+bam2nuc --genomic_composition_only \
+ --genome_folder bismark_reference_genome/
```
**Parameter Definitions:**
@@ -400,8 +409,10 @@ bam2nuc --genome_folder bismark_reference_genome/ --genomic_composition_only
* positional argument specifing the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
*bam2nuc*
-* --genome_folder - species the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
-* --genomic_composition_only - species creation of the genomic_nucleotide_frequencies.txt report, which is genome rathe than sample specific.
+* --genomic_composition_only - specifies creation of the (genome-specific) genomic_nucleotide_frequencies.txt report
+* --genome_folder - specifies the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
+
+
**Input data:**
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
@@ -427,7 +438,7 @@ bam2nuc --genome_folder bismark_reference_genome/ --genomic_composition_only
* BS_GA.rev.2.bt2
* genome_mfa.GA_conversion.fa
* \*.txt (captured standard output from the command)
-* bismark_reference_genome/genomic_nucleotide_frequencies.txt
+* bismark_reference_genome/genomic_nucleotide_frequencies.txt (tab-delimited table of mono- and di-nucleotide frequencies in reference genome)
@@ -501,7 +512,7 @@ mv sample-1_R1_trimmed_bismark_bt2_pe.bam sample-1_bismark_bt2_pe.bam
* sample-1_bismark_{bt2,bt2_pe,hisat2,hisat2_pe}.bam (mapping file)
* **\*_[SP]E_report.txt** (bismark mapping report file)
-* **\*.nucleotide_stats.txt** (tab-delimited table with sample-specific mono- and di-nucleotide sequence compositions and coverage values compared to genomic compositions
+* **\*.nucleotide_stats.txt** (tab-delimited table with sample-specific mono- and di-nucleotide sequence compositions and coverage values compared to genomic compositions)
> **NOTE**
@@ -598,7 +609,7 @@ deduplicate_bismark sample-1_bismark_{bt2,bt2_pe}.bam
**Output data:**
-* \*.deduplicated.bam (unsorted bismark bowtie2 alignment bam file, with duplicates removed)
+* **\*.deduplicated.bam** (unsorted bismark bowtie2 alignment bam file, with duplicates removed)
* **\*.deduplication_report.txt** (report file containing deduplication information)
@@ -608,7 +619,7 @@ deduplicate_bismark sample-1_bismark_{bt2,bt2_pe}.bam
```bash
samtools sort -@ NumberOfThreads \
- -o sample-1_bismark_bt2_sorted.deduplicated.bam \
+ -o sample-1_bismark_{bt2,bt2_pe}_sorted.deduplicated.bam \
sample-1_bismark_{bt2,bt2_pe}.deduplicated.bam
```
@@ -627,7 +638,7 @@ samtools sort -@ NumberOfThreads \
**Output data:**
-* **sample-1_bismark_bt2_sorted.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicated removed)
+* **sample-1_bismark_{bt2,bt2_pe}_sorted.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicates removed)
@@ -649,7 +660,7 @@ bismark_methylation_extractor --parallel NumberOfThreads \
sample-1_bismark_bt2.bam
# note, if *not working with RRBS data, input should be the deduplicated
# version (sample-1_bismark_bt2*.deduplicated.bam) produced in
- # step 6 above
+ # step 6a above
```
**Paired-end example**
@@ -667,7 +678,7 @@ bismark_methylation_extractor --parallel NumberOfThreads \
sample-1_bismark_bt2_pe.bam
# note, if *not working with RRBS data, input should be the deduplicated
# version (sample-1_bismark_bt2*.deduplicated.bam) produced in
- # step 6 above
+ # [Step 6a.](#6a-deduplicate) above
```
@@ -733,7 +744,7 @@ bismark2report --dir sample-1_bismark_report_out_dir/ \
> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
> **NOTE**
-> If data are **not** RRBS, the deduplication report from [step 6](#6-deduplicate-skip-if-data-are-rrbs) above should also be provided to the above command to the `--dedup_report` parameter
+> If data are **not** RRBS, the deduplication report from [Step 6a.](#6a-deduplicate) above should also be provided to the above command to the `--dedup_report` parameter
**Output data:**
@@ -758,7 +769,7 @@ bismark2summary sample-1_bismark_{bt2,bt2_pe}.bam
* the autodetected files cannot be explicitly provided, but it looks for those named like these listed here and includes them if they exist for each individual starting bam file it is given or finds
* sample-1_bismark_bt2_[SP]E_report.txt generated from [Step 4b.](#4b-align) above
* sample-1_bismark_{bt2,bt2_pe}_splitting_report.txt from [Step 7](#7-extract-methylation-calls) above
- * sample-1_bismark_{bt2,bt2_pe}.deduplication_report.txt if deduplication was performed in [Step 6](#6-deduplicate-skip-if-data-are-rrbs)
+ * sample-1_bismark_{bt2,bt2_pe}.deduplication_report.txt if deduplication was performed in [Step 6a.](#6a-deduplicate)
> **NOTE**
> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
@@ -774,7 +785,7 @@ bismark2summary sample-1_bismark_{bt2,bt2_pe}.bam
### 9b. Compile Alignment and Bismark QC
```bash
-multiqc --interactive -o align_multiqc_data/ -n align_multiqc -z \
+multiqc --interactive -o align_and_bismark_multiqc_data/ -n align_and_bismark_multiqc -z \
qualimap_out_dir/ mapping_files_out_dir/ methylation_calls_out_dir/ deduplication_out_dir/
```
@@ -786,7 +797,7 @@ multiqc --interactive -o align_multiqc_data/ -n align_multiqc -z \
* `-z` – specifies to zip the output data directory
* `qualimap_out_dir/` – the directory holding the output data from the qualimap run, provided as a positional argument
* `methylation_calls_out_dir/` – the directory holding the output data from the methylation extraction run, provided as a positional argument
-* `mapping_files_out_dir/` – the directory holding the output data from the fastqc run, provided as a positional argument
+* `mapping_files_out_dir/` – the directory holding the output data from the alignment run, provided as a positional argument
* `deduplication_out_dir/` – the directory holding the output data from the deduplication run, provided as a positional argument (omitted if RRBS data)
**Input data:**
@@ -800,8 +811,8 @@ multiqc --interactive -o align_multiqc_data/ -n align_multiqc -z \
**Output data:**
-* **align_multiqc_GLmethylSeq.html** (multiqc output html summary)
-* **align_multiqc_GLmethylSeq_data** (directory containing multiqc output data)
+* **align_and_bismark_multiqc_GLmethylSeq.html** (multiqc output html summary)
+* **align_and_bismark_multiqc_GLmethylSeq_data** (directory containing multiqc output data)
From 2eb2b69501b618e85520af0356755d7880516e5a Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Fri, 9 Aug 2024 16:27:17 -0700
Subject: [PATCH 05/13] - clipping and performance updates
- added clipping parameters for random-priming library type
- added gzip parameter to bismark alignment command to reduce
intermediate file size
---
.../GL-DPPD-7113.md | 33 +++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 1bcff34c..b179bf03 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -151,6 +151,8 @@ multiqc --interactive \
## 2. Adapter trimming/quality filtering
See `trim_galore --help` or [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md) for more info on any of the below.
+Additionally, the Bismark documentation also includes guidelines for specific MethylSeq library types: [Bismark library type guide](http://felixkrueger.github.io/Bismark/bismark/library_types/). Some library types will require additional 5' and/or 3' hard trimming to remove the signature of the oligos used for random priming. Leaving these bases may cause misalignments and methylation biases.
+
### If not RRBS or if RRBS using MseI digestion
@@ -186,6 +188,30 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
+### If using a random priming post-bisulfite method
+(such as TruSeq (formerly EpiGnome), PBAT, scBSSeq, Pico Methyl, Accel, etc.)
+Random priming is not truly random and the signature left at the ends of the reads can introduce errors, indels, and methylation biases. Add the optional clipping parameters (`--clip_r1`, `--clip_r2`, `--three_prime_clip_r1`, and `--three_prime_clip_r2`) to trim off the random priming signature on the 5' ends of each read and next to the 3' end after adapter trimming. See [Bismark library type guide](http://felixkrueger.github.io/Bismark/bismark/library_types/) for more detailed information.
+
+**Paired-end example for TruSeq (EpiGnome) library prep**
+```bash
+trim_galore --gzip \
+ --cores NumberOfThreads \
+ --phred33 \
+ --output_dir trimmed_reads_out_dir/ \
+ --paired \
+ --clip_R1 8 \
+ --clip_R2 8 \
+ --three_prime_clip_R1 8 \
+ --three_prime_clip_R2 8 \
+ sample-1_R1_raw.fastq.gz sample-1_R2_raw.fastq.gz
+
+# renaming outputs to use GeneLab standard suffix
+mv sample-1_R1_raw_val_1.fq.gz sample-1_R1_trimmed.fastq.gz
+mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
+```
+
+
+
### If RRBS with MspI digestion
Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end; see [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material)).
@@ -302,6 +328,10 @@ mv sample-1_R2_trimmed.fastq_trimmed.fq.gz sample-1_R2_trimmed.fastq.gz
* `-a2` - specific adapter sequence to be trimmed off of reverse reads (applicable for libraries prepared with the NuGEN ovation kit)
* `--paired` - specifies data are paired-end
* `--output_dir` - the output directory to store results
+* `--clip_R1` - number of bases to trim off the 5' end of each R1 read (optional, for use with library prep kits that use random priming, such as TruSeq(EpiGnome))
+* `--clip_R2` - number of bases to trim off the 5' end of each R2 read (optional, for use with library prep kits that use random priming, such as TruSeq(EpiGnome))
+* `--three_prime_clip_R1` - number of bases to trim off the 3' end of each R1 read AFTER adapter trimming. (optional, for use with library prep kits that use random priming, such as TruSeq(EpiGnome))
+* `--three_prime_clip_R2` - number of bases to trim off the 3' end of each R2 read AFTER adapter trimming. (optional, for use with library prep kits that use random priming, such as TruSeq(EpiGnome))
* positional arguments represent the input read files, 2 of them if paired-end data
@@ -459,6 +489,7 @@ bismark --bowtie2 \
--parallel NumberOfThreads \
--non_bs_mm \
--nucleotide_coverage \
+ --gzip \
--output_dir mapping_files_out_dir/ \
--genome_folder bismark_reference_genome/ \
sample-1_trimmed.fastq.gz
@@ -478,6 +509,7 @@ bismark --bowtie2 \
--parallel NumberOfThreads \
--non_bs_mm \
--nucleotide_coverage \
+ --gzip \
--output_dir mapping_files_out_dir/ \
--genome_folder bismark_reference_genome/ \
-1 sample-1_R1_trimmed.fastq.gz \
@@ -497,6 +529,7 @@ mv sample-1_R1_trimmed_bismark_bt2_pe.bam sample-1_bismark_bt2_pe.bam
* `--parallel` - allows us to specify the number of threads to use (note: will consume 3-5X this value)
* `--non_bs_mm` - outputs an extra column in the bam file specifying the number of non-bisulfite mismatches each read has
* `--nucleotide_coverage` - outputs a table with mono- and di-nucleotide sequence compositions and coverage values compared to genomic compositions
+* `--gzip` - write temporary bisulfite conversion files in gzip format to save disk space during alignment
* `--output_dir` - the output directory to store results
* `--genome_folder` - specifies the directory holding the reference genome indexes (the same that was provided in [Step 4a.](#4a-generate-reference) above)
* input trimmed-reads are provided as a positional argument if they are single-end data
From 7f29c8c32c6b6ed2fc8b491f5c841ba10a1fc075 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 28 Aug 2024 09:45:15 -0700
Subject: [PATCH 06/13] Update GL-DPPD-7113.md
Updated TOC with link to newly added trimming option.
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index b179bf03..0c50a16b 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -30,6 +30,7 @@ Jonathan Galazka (GeneLab Project Scientist)
- [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
- [2. Adapter trimming/quality filtering](#2-adapter-trimmingquality-filtering)
- [If not RRBS or if RRBS using MseI digestion](#if-not-rrbs-or-if-rrbs-using-msei-digestion)
+ - [If using a random priming bi-sulfite method](#if-using-a-random-priming-post-bisulfite-method)
- [If RRBS with MspI digestion](#if-rrbs-with-mspi-digestion)
- [If RRBS with NuGEN ovation kit](#if-rrbs-with-nugen-ovation-kit)
- [First adapter-trimming/quality-filtering with trimgalore](#first-adapter-trimmingquality-filtering-with-trimgalore)
From 45295e571ce6f8777a14932edd915a50ce84e931 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Wed, 28 Aug 2024 09:49:24 -0700
Subject: [PATCH 07/13] typo fix
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 0c50a16b..ccf3f43d 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -30,7 +30,7 @@ Jonathan Galazka (GeneLab Project Scientist)
- [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
- [2. Adapter trimming/quality filtering](#2-adapter-trimmingquality-filtering)
- [If not RRBS or if RRBS using MseI digestion](#if-not-rrbs-or-if-rrbs-using-msei-digestion)
- - [If using a random priming bi-sulfite method](#if-using-a-random-priming-post-bisulfite-method)
+ - [If using a random priming bisulfite method](#if-using-a-random-priming-post-bisulfite-method)
- [If RRBS with MspI digestion](#if-rrbs-with-mspi-digestion)
- [If RRBS with NuGEN ovation kit](#if-rrbs-with-nugen-ovation-kit)
- [First adapter-trimming/quality-filtering with trimgalore](#first-adapter-trimmingquality-filtering-with-trimgalore)
From 48a18cf6859618af68fca3754c5c52e00f4dab0c Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Thu, 5 Sep 2024 11:04:09 -0700
Subject: [PATCH 08/13] Updating ToC
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 2 --
1 file changed, 2 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index ccf3f43d..e4199cf2 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -21,8 +21,6 @@ Jonathan Galazka (GeneLab Project Scientist)
# Table of contents
-- [Bioinformatics pipeline for Methylation Sequencing (Methyl-Seq) data](#bioinformatics-pipeline-for-methylation-sequencing-methyl-seq-data)
-- [Table of contents](#table-of-contents)
- [Software used](#software-used)
- [General processing overview with example commands](#general-processing-overview-with-example-commands)
- [1. Raw Data QC](#1-raw-data-qc)
From da9e89df2b0cd091aca30e039e54f3b1205bc8c3 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 10 Sep 2024 14:33:01 -0700
Subject: [PATCH 09/13] Update GL-DPPD-7113.md
Added missing UCSC tools to software table.
---
.../GL-DPPD-7113.md | 32 ++++++++++---------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index e4199cf2..780f087b 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -66,21 +66,23 @@ Jonathan Galazka (GeneLab Project Scientist)
# Software used
|Program|Version|Relevant Links|
-|:-----------|:------:|------:|
-|FastQC | 0.12.0 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
-|MultiQC | 1.22.1 |[https://multiqc.info/](https://multiqc.info/)|
-|Cutadapt | 4.8 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
-|TrimGalore! | 0.6.7 |[https://github.com/FelixKrueger/TrimGalore](https://github.com/FelixKrueger/TrimGalore)|
-|Bismark | 0.24.2 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
-|bowtie2 | 2.5.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
-|hisat2 | 2.2.1 |[https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
-|samtools | 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
-|qualimap | 2.3 |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
-|R | 4.4.0 |[https://www.r-project.org](https://www.r-project.org)|
-|tidyverse | 2.0.0 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
-|Bioconductor| 3.19 |[https://bioconductor.org](https://bioconductor.org)
-|methylKit | 1.30.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
-|genomation | 1.36.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
+|:------------|:------:|------:|
+|FastQC | 0.12.0 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
+|MultiQC | 1.22.1 |[https://multiqc.info/](https://multiqc.info/)|
+|Cutadapt | 4.8 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
+|TrimGalore! | 0.6.10 |[https://github.com/FelixKrueger/TrimGalore](https://github.com/FelixKrueger/TrimGalore)|
+|Bismark | 0.24.2 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
+|bowtie2 | 2.5.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
+|hisat2 | 2.2.1 |[https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
+|samtools | 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+|qualimap | 2.3 |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
+|gtfToGenePred| 447 |[http://hgdownload.cse.ucsc.edu/admin/exe/](http://hgdownload.cse.ucsc.edu/admin/exe/)|
+|genePredToBed| 447 |[http://hgdownload.cse.ucsc.edu/admin/exe/](http://hgdownload.cse.ucsc.edu/admin/exe/)
+|R | 4.4.0 |[https://www.r-project.org](https://www.r-project.org)|
+|tidyverse | 2.0.0 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
+|Bioconductor | 3.19 |[https://bioconductor.org](https://bioconductor.org)
+|methylKit | 1.30.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
+|genomation | 1.36.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
---
From 3f88317fa597864cfe9f594046474bbc3835cc51 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 10 Sep 2024 15:35:41 -0700
Subject: [PATCH 10/13] Update GL-DPPD-7113.md
update FastQC version
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 780f087b..9231d26e 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -67,7 +67,7 @@ Jonathan Galazka (GeneLab Project Scientist)
|Program|Version|Relevant Links|
|:------------|:------:|------:|
-|FastQC | 0.12.0 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
+|FastQC | 0.12.1 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
|MultiQC | 1.22.1 |[https://multiqc.info/](https://multiqc.info/)|
|Cutadapt | 4.8 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
|TrimGalore! | 0.6.10 |[https://github.com/FelixKrueger/TrimGalore](https://github.com/FelixKrueger/TrimGalore)|
From 893ef6b111f4fe58ec97621ebe160ce37767f443 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 28 Oct 2024 16:37:02 -0700
Subject: [PATCH 11/13] Update GL-DPPD-7113.md
Specified genomic_nucleotide_frequencies.txt as an output file.
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index 9231d26e..e853b423 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -469,7 +469,7 @@ bam2nuc --genomic_composition_only \
* BS_GA.rev.2.bt2
* genome_mfa.GA_conversion.fa
* \*.txt (captured standard output from the command)
-* bismark_reference_genome/genomic_nucleotide_frequencies.txt (tab-delimited table of mono- and di-nucleotide frequencies in reference genome)
+* **bismark_reference_genome/genomic_nucleotide_frequencies.txt** (tab-delimited table of mono- and di-nucleotide frequencies in reference genome)
From 82b9d0b69677b81a05c22034649508c05e404991 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 6 Nov 2024 19:20:30 -0800
Subject: [PATCH 12/13] Update GL-DPPD-7113.md
Fix typos
---
.../Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index e853b423..a86a2593 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -249,7 +249,7 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
### If RRBS with NuGEN ovation kit
-Libraries prepared with the NuGEN ovation kit need to be procesed with an additional script provided by the company's [github](https://github.com/nugentechnologies/NuMetRRBS#analysis-guide-for-nugen-ovation-rrbs-methyl-seq).
+Libraries prepared with the NuGEN ovation kit need to be processed with an additional script provided by the company's [github](https://github.com/nugentechnologies/NuMetRRBS#analysis-guide-for-nugen-ovation-rrbs-methyl-seq).
Following their instructions, we first run an adapter-trimming/quality-filtering step with trimgalore. Note that the `--rrbs` option is not appropriate to pass to trimgalore when this kit is used (see Bismark documentation for [RRBS NuGEN Ovation Methyl-Seq System](http://felixkrueger.github.io/Bismark/bismark/library_types/#rrbs-nugen-ovation-methyl-seq-system). Then we utilize the company's script to remove the random diversity sequences added by the kit.
@@ -437,7 +437,7 @@ bam2nuc --genomic_composition_only \
*bismark_genome_preparation*
* `--bowtie2` - specify bismark to create bisulfite indexes for use with Bowtie2
* `--parallel` – specifies how many threads to use (note these will be doubled as it operates on both strands simultaneously)
-* positional argument specifing the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
+* positional argument specifying the directory holding the reference genome (should end in ".fa" or ".fasta", can be gzipped and including ".gz")
*bam2nuc*
* --genomic_composition_only - specifies creation of the (genome-specific) genomic_nucleotide_frequencies.txt report
@@ -598,8 +598,8 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
* `bamqc` - specifies the `bamqc` sub-program of `qualimap`
* `-bam` - specifies the input bam file
-* `-gff` - specifices the feature file contining regions of interest for the reference genome (can be gff, gtf, or bed format)
-* `-outdir` - specifices the path to print the alignment QC output files to
+* `-gff` - specifies the feature file containing regions of interest for the reference genome (can be gff, gtf, or bed format)
+* `-outdir` - specifies the path to print the alignment QC output files to
* `--collect-overlap-pairs` - instructs the program to output statistics of overlapping paired-end reads (if data were paired-end, no effect if single-end)
* `--java-mem-size=6G` - specifies the amount of memory to use (here this is set to 6G; see [qualimap FAQ here](http://qualimap.conesalab.org/doc_html/faq.html?highlight=java-mem-size))
* `-nt` - specifies the number of threads to use
@@ -607,7 +607,7 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
**Input data:**
* sample-1_bismark_{bt2,hisat2}_sorted.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above)
-* a feature file contining regions of interest for the reference genome in gtf format (this pipeline version uses the Ensembl fasta file indicated in the `gtf` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
+* a feature file containing regions of interest for the reference genome in gtf format (this pipeline version uses the Ensembl fasta file indicated in the `gtf` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
> **NOTE**
> If using RNA, files will include "bismark_hisat2" instead of "bismark_bt2" in the name.
@@ -726,7 +726,7 @@ bismark_methylation_extractor --parallel NumberOfThreads \
* `--cytosine_report` - instructions the program to produce a genome-wide methylation report for all cytosines in the genome
* `--genome_folder` - a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
* `--ignore_r2` - specifies how many bases to ignore from the 5' end of the reverse reads (bismark docs recommend 2, see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#options))
- > Note: The first couple of bases in the reverse read of bisulfite sequence experiments show a severe bias towards non-methylation as a result of end-reparing sonicated fragments with unmentulated cytosines, so it is recommened to remove the first couple basepairs
+ > Note: The first couple of bases in the reverse read of bisulfite sequence experiments show a severe bias towards non-methylation as a result of end-repairing sonicated fragments with unmethylated cytosines, so it is recommended to remove the first few basepairs
* `--ignore_3prime_r2` - specifies how many bases to ignore from the 3' end of the reverse reads to remove unwanted biases from the end of reads (For specific recommnendations see Bismark documentation on [Library Types](https://felixkrueger.github.io/Bismark/bismark/library_types/))
* sample-1_bismark_bt2_sorted.bam - the input bam file, provided as a positional argument
@@ -1081,7 +1081,7 @@ tiles_myDiff.all_sig <- getMethylDiff(tiles_diff, difference = 25, qvalue = 0.01
# Get significantly hypermethylated tiles
tiles_myDiff.hyper <- getMethylDiff(tiles_diff, difference = 25, qvalue = 0.01, type = "hyper")
-# Get signifcantly hypomethylated tiles
+# Get significantly hypomethylated tiles
tiles_myDiff.hypo <- getMethylDiff(tiles_diff, difference = 25, qvalue = 0.01, type = "hypo")
```
From 8247a33ea17f0291a23ac400b3762709bbde0e94 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Thu, 7 Nov 2024 14:50:16 -0800
Subject: [PATCH 13/13] Update GL-DPPD-7113.md
Updated broken documentation link
---
Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
index a86a2593..1280191e 100644
--- a/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
+++ b/Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md
@@ -741,7 +741,7 @@ bismark_methylation_extractor --parallel NumberOfThreads \
**Output data:**
* **\*\_context\_\*.txt.gz** (bismark methylation-call files for CpG, CHG, and CHH contexts that were detected; see [bismark documentation](https://felixkrueger.github.io/Bismark/), namely [methylation call](http://felixkrueger.github.io/Bismark/bismark/alignment/#methylation-call) for symbols, and [methylation extraction output](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#the-methylation-extractor-output-looks-like-this-tab-separated) for file format)
-* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
+* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-output))
* **\*.bismark.cov.gz** (gzip-compressed bedGraph-formatted file like above "\*.bedGraph.gz", but also including 2 more columns of methylated and unmethylated counts at the specified position; see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-specific-options))
* **\*.M-bias.txt** (text file with methylation information in the context of the position in reads, helpful for investigating bias as a function of base position in the read; see [bismark documentation](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#m-bias-plot))
* **\*_splitting_report.txt** (text file containing general methylation detection information)