Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/minimap2/1.0 #262

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions modules/minimap2/1.0/config/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
lcr-modules:

minimap2:

# TODO: Update the list of available wildcards, if applicable
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
inputs:
# Available wildcards: {seq_type} {genome_build} {sample_id}
sample_fastq: "__UPDATE__"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any indexes required for this tool to work? Maybe we should add them here as inputs as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minimap can work with paired end data - can you think of a way you might be able to adapt this so that it can use paired or unpaired fastq files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No index files are required to run this tool

reference_build: "__UPDATE__"
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved

scratch_subdirectories: []

options:
minimap2: '-ax map-ont -L --MD -Y -R "@RG\tID:{sample_id}\tLB:{sample_id}\tPL:ONT\tSM:{sample_id}"'
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
samtools: '-bhS'

conda_envs:
minimap2: "{MODSDIR}/envs/minimap2-2.24.yaml"
samtools: "{MODSDIR}/envs/samtools-1.9.yaml"

threads:
minimap2: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder how long this tool takes to run and if we should consider giving it more resources here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it more resources (10) and now it takes ~24 hours

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kostia's point is fair - it makes sense to set the default here MUCH higher. I think you should make 24 the default.

samtools: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for samtools - make the default higher please!


resources:
minimap2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient memory allocation for a PromethION whole genome?

mem_mb: 30000
samtools:
mem_mb: 10000

pairing_config:
promethION:
run_paired_tumours: False
run_unpaired_tumours_with: "no_normal"
run_paired_tumours_as_unpaired: True

16 changes: 16 additions & 0 deletions modules/minimap2/1.0/envs/minimap2-2.24.yaml
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: minimap2
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- k8=0.2.5
- libgcc-ng=12.2.0
- libgomp=12.2.0
- libstdcxx-ng=12.2.0
- libzlib=1.2.13
- minimap2=2.24
- zlib=1.2.13
prefix: /home/hshaalan/miniconda3/envs/minimap2
1 change: 1 addition & 0 deletions modules/minimap2/1.0/envs/samtools-1.9.yaml
168 changes: 168 additions & 0 deletions modules/minimap2/1.0/minimap2.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#!/usr/bin/env snakemake


##### ATTRIBUTION #####


# Original Author: Haya Shaalan
# Module Author: Haya Shaalan
# Contributors: N/A


##### SETUP #####

# Import package with useful functions for developing analysis modules
import oncopipe as op

# Check that the oncopipe dependency is up-to-date. Add all the following lines to any module that uses new features in oncopipe
min_oncopipe_version="1.0.11"
import pkg_resources
try:
from packaging import version
except ModuleNotFoundError:
sys.exit("The packaging module dependency is missing. Please install it ('pip install packaging') and ensure you are using the most up-to-date oncopipe version")

# To avoid this we need to add the "packaging" module as a dependency for LCR-modules or oncopipe

current_version = pkg_resources.get_distribution("oncopipe").version
if version.parse(current_version) < version.parse(min_oncopipe_version):
print('\x1b[0;31;40m' + f'ERROR: oncopipe version installed: {current_version}' + '\x1b[0m')
print('\x1b[0;31;40m' + f"ERROR: This module requires oncopipe version >= {min_oncopipe_version}. Please update oncopipe in your environment" + '\x1b[0m')
sys.exit("Instructions for updating to the current version of oncopipe are available at https://lcr-modules.readthedocs.io/en/latest/ (use option 2)")

# End of dependency checking section

# Setup module and store module-specific configuration in `CFG`
# `CFG` is a shortcut to `config["lcr-modules"]["minimap2"]`
CFG = op.setup_module(
name = "minimap2",
version = "1.0",
# TODO: If applicable, add more granular output subdirectories
subdirectories = ["inputs", "minimap2", "sort_bam", "outputs"],
)

#include: "../../utils/2.1/utils.smk"
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved

# Define rules to be run locally when using a compute cluster
localrules:
_minimap2_input_fastq,
_minimap2_symlink_bam,
_minimap2_output_bam,
_minimap2_all,

sample_ids_minimap2 = list(CFG['samples']['sample_id'])

##### RULES #####


# Symlinks the input files into the module results directory (under '00-inputs/')
rule _minimap2_input_fastq:
input:
fastq = CFG["inputs"]["sample_fastq"]
output:
fastq = CFG["dirs"]["inputs"] + "fastq/{seq_type}/{sample_id}.fastq.gz"
run:
op.absolute_symlink(input.fastq, output.fastq)


rule _minimap2_run:
input:
fastq = str(rules._minimap2_input_fastq.output.fastq),
fasta = reference_files("genomes/{genome_build}/genome_fasta/genome.fa")
output:
sam = pipe(CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.sam")
lkhilton marked this conversation as resolved.
Show resolved Hide resolved
log:
stdout = CFG["logs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}/minimap2.stdout.log",
stderr = CFG["logs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}/minimap2.stderr.log"
params:
opts = CFG["options"]["minimap2"]
conda:
CFG["conda_envs"]["minimap2"]
threads:
CFG["threads"]["minimap2"]
resources:
**CFG["resources"]["minimap2"]
shell:
op.as_one_line("""
minimap2 {params.opts}
-t {threads}
{input.fasta}
{input.fastq}
-o {output.sam}
> {log.stdout}
2> {log.stderr}
""")


rule _minimap2_samtools:
input:
sam = str(rules._minimap2_run.output.sam)
output:
bam = CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam",
complete = touch(CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam.complete")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this works here because this output is not used in the shell command or elsewhere in the module. Maybe this is left from the earlier versions and needs to be dropped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snakemake uses the touch flag to touch an empty file without having to write a shell command for it https://snakemake.readthedocs.io/en/master/snakefiles/rules.html#flag-files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kostia's point is that writing the .complete file alone isn't enough. For it to have the desired effect (forcing this rule to re-run if it's interrupted/the bam file is incomplete) it has to be an input to a subsequent rule. Can you add the complete file from this rule as an input to the next rule in the module?

log:
stderr = CFG["logs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}/samtools.stderr.log"
params:
opts = CFG["options"]["samtools"]
conda:
CFG["conda_envs"]["samtools"]
threads:
CFG["threads"]["samtools"]
resources:
**CFG["resources"]["samtools"]
shell:
op.as_one_line("""
samtools view {params.opts}
{input.sam} > {output.bam}
2> {log.stderr}
""")


rule _minimap2_symlink_bam:
lkhilton marked this conversation as resolved.
Show resolved Hide resolved
input:
bam = str(rules._minimap2_samtools.output.bam)
output:
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.bam"
wildcard_constraints:
sample_id = "|".join(sample_ids_minimap2)
run:
op.absolute_symlink(input.bam, output.bam)


# Symlinks the final output files into the module results directory (under '99-outputs/')
rule _minimap2_output_bam:
input:
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam",
lkhilton marked this conversation as resolved.
Show resolved Hide resolved
bai = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam.bai",
sorted_bam = str(rules._minimap2_symlink_bam.input.bam)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually the output of the rule is requested as input of the following rule, not the input like in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to delete the unsorted bam once the sorted bam is created by the utils module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - it's not the sorted bam that's being deleted but the un-sorted bam, right?

Duplicate marking isn't necessary for PromethION, but I think it's still an important step for short-read WGS to include in this module. Can you add some rules to complete duplicate marking for short reads? You can use wildcard constraints and an input function to determine which output to symlink depending on the seq_type. (Also, this could be used for capture data in theory, right?)

output:
bam = CFG["dirs"]["outputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam"
wildcard_constraints:
sample_id = "|".join(sample_ids_minimap2)
run:
op.relative_symlink(input.bam, output.bam, in_module=True)
op.relative_symlink(input.bai, output.bam + ".bai", in_module=True)
os.remove(input.sorted_bam)
shell("touch {input.sorted_bam}.deleted")



# Generates the target sentinels for each run, which generate the symlinks
rule _minimap2_all:
input:
expand(
[
str(rules._minimap2_output_bam.output.bam),
],
zip, # Run expand() with zip(), not product()
seq_type=CFG["samples"]["seq_type"],
genome_build=CFG["inputs"]["reference_build"],
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
sample_id=CFG["samples"]["sample_id"])


##### CLEANUP #####


# Perform some clean-up tasks, including storing the module-specific
# configuration on disk and deleting the `CFG` variable
op.cleanup_module(CFG)
1 change: 1 addition & 0 deletions modules/minimap2/1.0/schemas/base-1.0.yaml
14 changes: 14 additions & 0 deletions modules/minimap2/CHANGELOG.md
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Changelog

All notable changes to the `minimap2` module will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0] - 2023-03-14

This release was authored by Haya Shaalan.

<!-- TODO: Explain each important module design decision below. -->

- No module design decisions explained here yet.
2 changes: 1 addition & 1 deletion modules/utils/2.1/config/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ lcr-modules:
## Lines commented out with `#?` can optionally be user-configured
## Lines commented out with `##` act as regular comments

paired_modules: ["bwa_mem", "star"]
paired_modules: ["bwa_mem", "star", minimap2]
hayashaalan marked this conversation as resolved.
Show resolved Hide resolved
## See main README.md for how to set `samples` in the Snakefile
#! samples: null

Expand Down