Skip to content

Commit

Permalink
Merge pull request #126 from nasa/DEV_NF_MAAgilent_1ch
Browse files Browse the repository at this point in the history
Microarray Agilent 1-channel workflow updates from v 1.0.3 to 1.0.4
  • Loading branch information
asaravia-butler authored Oct 22, 2024
2 parents 90d6bb5 + 8383bb5 commit 4b3c7b3
Show file tree
Hide file tree
Showing 12 changed files with 138 additions and 24 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.4](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_1.0.4/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch) - 2024-10-02

### Added

- Add automatic generation of processed data protocol ([#85](https://github.com/nasa/GeneLab_Data_Processing/issues/85))

### Changed

- Small bug fixes in `Agile1CMP.qmd`
- Check if `getBM()` returned results before concatenating it to dataframe to avoid error in `bind_rows()` ([#96](https://github.com/nasa/GeneLab_Data_Processing/issues/96))
- When renaming column names, specify which columns to rename to avoid unintentional renaming ([#97](https://github.com/nasa/GeneLab_Data_Processing/issues/97))
- When renaming factor names, prevent cases where a factor is partially renamed because it contains a substring that is another factor ([#100](https://github.com/nasa/GeneLab_Data_Processing/issues/100))
- Update software table generation to exclude `R.utils` from table if data files are not compressed ([#99](https://github.com/nasa/GeneLab_Data_Processing/issues/99))

## [1.0.3](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_1.0.3/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch) - 2024-05-17

### Changed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,9 +93,9 @@ We recommend installing Singularity on a system wide level as per the associated
All files required for utilizing the NF_MAAgilent1ch GeneLab workflow for processing Agilent 1 Channel Microarray data are in the [workflow_code](workflow_code) directory. To get a copy of latest NF_MAAgilent1ch version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
```bash
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MAAgilent1ch_1.0.3/NF_MAAgilent1ch_1.0.3.zip
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MAAgilent1ch_1.0.4/NF_MAAgilent1ch_1.0.4.zip
unzip NF_MAAgilent1ch_1.0.3.zip
unzip NF_MAAgilent1ch_1.0.4.zip
```
<br>
Expand All @@ -104,15 +104,15 @@ unzip NF_MAAgilent1ch_1.0.3.zip
### 3. Run the Workflow
While in the location containing the `NF_MAAgilent1ch_1.0.3` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAgilent1ch workflow:
While in the location containing the `NF_MAAgilent1ch_1.0.4` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAgilent1ch workflow:
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
<br>
#### 3a. Approach 1: Run the workflow on a GeneLab Agilent 1 Channel Microarray dataset
```bash
nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
nextflow run NF_MAAgilent1ch_1.0.4/main.nf \
-profile singularity \
--osdAccession OSD-548 \
--gldsAccession GLDS-548
Expand All @@ -125,7 +125,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
> Note: Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
```bash
nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
nextflow run NF_MAAgilent1ch_1.0.4/main.nf \
-profile singularity \
--runsheetPath </path/to/runsheet>
```
Expand All @@ -134,7 +134,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
**Required Parameters For All Approaches:**
* `NF_MAAgilent1ch_1.0.3/main.nf` - Instructs Nextflow to run the NF_MAAgilent1ch workflow
* `NF_MAAgilent1ch_1.0.4/main.nf` - Instructs Nextflow to run the NF_MAAgilent1ch workflow
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
Expand Down Expand Up @@ -166,7 +166,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
All parameters listed above and additional optional arguments for the NF_MAAgilent1ch workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
```bash
nextflow run NF_MAAgilent1ch_1.0.3/main.nf --help
nextflow run NF_MAAgilent1ch_1.0.4/main.nf --help
```
See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.
Expand All @@ -180,7 +180,7 @@ See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nex
All R code steps and output are rendered within a Quarto document yielding the following:
- Output:
- NF_MAAgilent1ch_1.0.3.html (html report containing executed code and output including QA plots)
- NF_MAAgilent1ch_1.0.4.html (html report containing executed code and output including QA plots)
The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Agilent 1 Channel Processing"
subtitle: "Workflow Version: NF_MAAgilent1ch_1.0.3"
subtitle: "Workflow Version: NF_MAAgilent1ch_1.0.4"
date: now
title-block-banner: true
format:
Expand Down Expand Up @@ -530,7 +530,10 @@ if (organism %in% c("athaliana")) {
values = probe_id_chunk,
mart = ensembl)
df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
if (nrow(chunk_results) > 0) {
df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
}
Sys.sleep(10) # Slight break between requests to prevent back-to-back requests
}
}
Expand Down Expand Up @@ -712,7 +715,7 @@ reformat_names <- function(colname, group_name_mapping) {
stringr::str_replace(pattern = ".condition", replacement = "v")
# remap to group names before make.names was applied
unique_group_name_mapping <- unique(group_name_mapping)
unique_group_name_mapping <- unique(group_name_mapping) %>% arrange(-nchar(safe_name))
for ( i in seq(nrow(unique_group_name_mapping)) ) {
safe_name <- unique_group_name_mapping[i,]$safe_name
original_name <- unique_group_name_mapping[i,]$original_name
Expand All @@ -722,7 +725,7 @@ reformat_names <- function(colname, group_name_mapping) {
return(new_colname)
}
df_interim <- df_interim %>% dplyr::rename_with( reformat_names, group_name_mapping = design_data$mapping )
df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition|^Genes\\.'), group_name_mapping = design_data$mapping)
# Concatenate expression values for each sample
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,9 @@ Staging:
Sample name is used as a unique sample identifier during processing
Example: Atha_Col-0_Root_WT_Ctrl_45min_Rep1_GSM502538

- ISA Field Name: Label
- ISA Field Name:
- Label
- Parameter Value[label]
ISA Table Source: Sample
Runsheet Column Name: Label
Processing Usage: >-
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,11 @@ workflow {
ch_software_versions = Channel.value(nf_version)
AGILE1CH.out.versions | map{ it -> it.text } | mix(ch_software_versions) | set{ch_software_versions}
VV_AGILE1CH.out.versions | map{ it -> it.text } | mix(ch_software_versions) | set{ch_software_versions}
ch_software_versions | unique
| collectFile(
newLine: true,
sort: true,
cache: false
)
| GENERATE_SOFTWARE_TABLE

GENERATE_SOFTWARE_TABLE(
ch_software_versions | unique | collectFile(newLine: true, sort: true, cache: false),
ch_runsheet | splitCsv(header: true, quote: '"') | first | map{ row -> row['Array Data File Name'] }
)

emit:
meta = ch_meta
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@ process GENERATE_SOFTWARE_TABLE {

input:
path("software_versions.yaml")
val(filename)

output:
path("software_versions_GLmicroarray.md")

script:
"""
SoftwareYamlToMarkdownTable.py software_versions.yaml
SoftwareYamlToMarkdownTable.py software_versions.yaml \"$filename\"
"""
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,19 @@

@click.command()
@click.argument("input_yaml", type=click.Path(exists=True))
def yamlToMarkdown(input_yaml: Path):
@click.argument("filename")
def yamlToMarkdown(input_yaml: Path, filename: str):
""" Using a software versions """
with open(input_yaml, "r") as f:
data = yaml.safe_load(f)

data.extend(ASSUMED_SOFTWARE)
df = pd.DataFrame(data)

# If data files are not compressed, won't use R.utils to unzip them during processing
if not filename.endswith('.gz'):
AGILENT_SOFTWARE_DPPD.remove('r.utils')

# Filter to direct software used (i.e. exclude dependencies of the software)
df = df.loc[df["name"].str.lower().isin(AGILENT_SOFTWARE_DPPD)]

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
process GENERATE_PROTOCOL {
tag "${ params.gldsAccession }"
publishDir "${ params.outputDir }/${ params.gldsAccession }/GeneLab",
mode: params.publish_dir_mode

input:
path("software_versions_GLmicroarray.md")
val(organism)

output:
path("PROTOCOL_GLmicroarray.txt")

script:
"""
generate_protocol.sh $workflow.manifest.version \"$organism\"
"""
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash
set -u

software_versions_file="software_versions_GLmicroarray.md"

# Read the markdown table
while read -r line; do
# Extract program, version, and link
program=$(echo "$line" | awk -F'|' '{gsub(/^[[:blank:]]+|[[:blank:]]+$/,"",$1); print $1}')
version=$(echo "$line" | awk -F'|' '{gsub(/^[[:blank:]]+|[[:blank:]]+$/,"",$2); print $2}')

# Skip the header row and rows without version information
if [[ $program != "Program" && $version != "Version" && ! -z $version ]]; then
# Replace invalid characters in program name with underscores
sanitized_program=$(echo "$program" | tr -cd '[:alnum:]_')

# Create environment variable name
env_var_name="${sanitized_program}_VERSION"

# Set the environment variable
export "$env_var_name=$version"
fi
done < <(sed -n '/|/p' "$software_versions_file" | sed 's/^ *|//;s/|$//')

# Print the extracted versions
env | grep "_VERSION"

# Get organism
organism=$2

# List of organisms
organism_list=("Homo sapiens" "Mus musculus" "Rattus norvegicus" "Drosophila melanogaster" "Caenorhabditis elegans" "Danio rerio" "Saccharomyces cerevisiae")

# Check the value of 'organism' variable and set 'GENE_MAPPING_STEP' accordingly
if [[ $organism == "Arabidopsis thaliana" ]]; then
GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probe using the Plants Ensembl database ftp server (plants.ensembl.org, release 54)."
elif [[ " ${organism_list[*]} " == *"${organism//\"/}"* ]]; then
GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probe using biomaRt (version ${biomaRt_VERSION}), Ensembl database (ensembl.org, release 107)."
else
GENE_MAPPING_STEP="TBD"
fi

# Check the value of 'organism' variable and set 'GENE_ANNOTATION_DB' accordingly
if [[ $organism == "Arabidopsis thaliana" ]]; then
GENE_ANNOTATION_DB="org.At.tair.db"
elif [[ $organism == "Homo sapiens" ]]; then
GENE_ANNOTATION_DB="org.Hs.eg.db"
elif [[ $organism == "Mus musculus" ]]; then
GENE_ANNOTATION_DB="org.Mm.eg.db"
elif [[ $organism == "Rattus norvegicus" ]]; then
GENE_ANNOTATION_DB="org.Rn.eg.db"
elif [[ $organism == "Drosophila melanogaster" ]]; then
GENE_ANNOTATION_DB="org.Dm.eg.db"
elif [[ $organism == "Caenorhabditis elegans" ]]; then
GENE_ANNOTATION_DB="org.Ce.eg.db"
elif [[ $organism == "Danio rerio" ]]; then
GENE_ANNOTATION_DB="org.Dr.eg.db"
elif [[ $organism == "Saccharomyces cerevisiae" ]]; then
GENE_ANNOTATION_DB="org.Sc.sgd.db"
else
GENE_ANNOTATION_DB="TBD"
fi

# Read the template file
template="Data were processed as described in GL-DPPD-7112 ([https://github.com/nasa/GeneLab_Data_Processing/blob/master/Microarray/Agilent_1-channel/Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md]), using NF_MAAgilent1ch version $1 ([https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_$1/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch]). In short, a RunSheet containing raw data file location and processing metadata from the study's *ISA.zip file was generated using dp_tools (version ${dp_tools_VERSION}). The raw array data files were loaded into R (version ${R_VERSION}) using limma (version ${limma_VERSION}). Raw data quality assurance density, pseudo image, MA, and foreground-background plots were generated using limma (version ${limma_VERSION}), and boxplots were generated using ggplot2 (version ${ggplot2_VERSION}). The raw intensity data was background corrected and normalized across arrays via the limma (version ${limma_VERSION}) quantile method. Normalized data quality assurance density, pseudo image, and MA plots were generated using limma (version ${limma_VERSION}), and boxplots were generated using ggplot2 (version ${ggplot2_VERSION}). ${GENE_MAPPING_STEP} Differential expression analysis was performed in R (version ${R_VERSION}) using limma (version ${limma_VERSION}); all groups were compared pairwise for each probe to generate a moderated t-statistic and associated p- and adjusted p-value. Gene annotations were assigned using the custom annotation tables generated in-house as detailed in GL-DPPD-7110 ([https://github.com/nasa/GeneLab_Data_Processing/blob/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md]), with STRINGdb (version 2.8.4), PANTHER.db (version 1.0.11), and ${GENE_ANNOTATION_DB} (version 3.15.0)."

# Output the filled template
echo "$template" > PROTOCOL_GLmicroarray.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ manifest {
mainScript = 'main.nf'
defaultBranch = 'main'
nextflowVersion = '>=23.10.1'
version = '1.0.3'
version = '1.0.4'
}

def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ c_reset = "\033[0m";

include { GENERATE_MD5SUMS } from './modules/GENERATE_MD5SUMS.nf'
include { UPDATE_ISA_TABLES } from './modules/UPDATE_ISA_TABLES.nf'
include { GENERATE_PROTOCOL } from './modules/POST_PROCESSING/GENERATE_PROTOCOL'

/**************************************************
* HELP MENU **************************************
Expand Down Expand Up @@ -49,6 +50,7 @@ workflow {
main:
ch_processed_directory = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }", checkIfExists: true)
ch_runsheet = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }/Metadata/*_runsheet.csv", checkIfExists: true)
ch_software_versions = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }/GeneLab/software_versions_GLmicroarray.md", checkIfExists: true)
GENERATE_MD5SUMS(
ch_processed_directory,
ch_runsheet,
Expand All @@ -59,4 +61,8 @@ workflow {
ch_runsheet,
"${ projectDir }/bin/dp_tools__agilent_1_channel" // dp_tools plugin
)
GENERATE_PROTOCOL(
ch_software_versions,
ch_runsheet | splitCsv(header: true, quote: '"') | first | map{ row -> row['organism'] }
)
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

|Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version|
|:---------------|:---------------------------------------------------------|:---------------|
|*[GL-DPPD-7112.md](../Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md)|[NF_MAAgilent1ch_1.0.3](NF_MAAgilent1ch)|23.10.1|
|*[GL-DPPD-7112.md](../Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md)|[NF_MAAgilent1ch_1.0.4](NF_MAAgilent1ch)|23.10.1|

*Current GeneLab Pipeline/Workflow Implementation

Expand Down

0 comments on commit 4b3c7b3

Please sign in to comment.