-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module/minimap2/1.0 #262
base: master
Are you sure you want to change the base?
Module/minimap2/1.0 #262
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hayashaalan !! Great!! I think my biggest confusion is whether (or not) we need to include the utils here - not clear if you run it with or without that module. I assumed without since it is commented. Maybe I am not right?
# TODO: Update the list of available wildcards, if applicable | ||
inputs: | ||
# Available wildcards: {seq_type} {genome_build} {sample_id} | ||
sample_fastq: "__UPDATE__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any indexes required for this tool to work? Maybe we should add them here as inputs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minimap can work with paired end data - can you think of a way you might be able to adapt this so that it can use paired or unpaired fastq files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No index files are required to run this tool
samtools: "{MODSDIR}/envs/samtools-1.9.yaml" | ||
|
||
threads: | ||
minimap2: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder how long this tool takes to run and if we should consider giving it more resources here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it more resources (10) and now it takes ~24 hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kostia's point is fair - it makes sense to set the default here MUCH higher. I think you should make 24 the default.
input: | ||
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam", | ||
bai = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam.bai", | ||
sorted_bam = str(rules._minimap2_symlink_bam.input.bam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually the output of the rule is requested as input of the following rule, not the input like in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to delete the unsorted bam once the sorted bam is created by the utils module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear - it's not the sorted bam that's being deleted but the un-sorted bam, right?
Duplicate marking isn't necessary for PromethION, but I think it's still an important step for short-read WGS to include in this module. Can you add some rules to complete duplicate marking for short reads? You can use wildcard constraints and an input function to determine which output to symlink depending on the seq_type. (Also, this could be used for capture data in theory, right?)
# TODO: Update the list of available wildcards, if applicable | ||
inputs: | ||
# Available wildcards: {seq_type} {genome_build} {sample_id} | ||
sample_fastq: "__UPDATE__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minimap can work with paired end data - can you think of a way you might be able to adapt this so that it can use paired or unpaired fastq files?
inputs: | ||
# Available wildcards: {seq_type} {genome_build} {sample_id} | ||
sample_fastq: "__UPDATE__" | ||
reference_build: "__UPDATE__" | ||
sample_fastq: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs some documentation about the {number}
wildcard and how to correctly specify paired and unpaired fastq files.
samtools: "{MODSDIR}/envs/samtools-1.9.yaml" | ||
|
||
threads: | ||
minimap2: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kostia's point is fair - it makes sense to set the default here MUCH higher. I think you should make 24 the default.
|
||
threads: | ||
minimap2: 4 | ||
samtools: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for samtools - make the default higher please!
samtools: 1 | ||
|
||
resources: | ||
minimap2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient memory allocation for a PromethION whole genome?
@@ -0,0 +1 @@ | |||
/projects/rmorin/projects/gambl-repos/gambl-hshaalan/src/lcr-modules/envs/minimap2/minimap2-2.24.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the symlink to point to the lcr-modules/envs/minimap2
directory to parallel the samtools symlink below?
sam = str(rules._minimap2_run.output.sam) | ||
output: | ||
bam = CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam", | ||
complete = touch(CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam.complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kostia's point is that writing the .complete
file alone isn't enough. For it to have the desired effect (forcing this rule to re-run if it's interrupted/the bam file is incomplete) it has to be an input to a subsequent rule. Can you add the complete
file from this rule as an input to the next rule in the module?
input: | ||
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam", | ||
bai = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam.bai", | ||
sorted_bam = str(rules._minimap2_symlink_bam.input.bam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear - it's not the sorted bam that's being deleted but the un-sorted bam, right?
Duplicate marking isn't necessary for PromethION, but I think it's still an important step for short-read WGS to include in this module. Can you add some rules to complete duplicate marking for short reads? You can use wildcard constraints and an input function to determine which output to symlink depending on the seq_type. (Also, this could be used for capture data in theory, right?)
@hayashaalan just checking if you have any updates for this? |
@hayashaalan do you have any updates for this module? |
Pull Request Checklists
Important: When opening a pull request, keep only the applicable checklist and delete all other sections.
Checklist for New Module
Required
I used the cookiecutter template and updated the placeholder rules.
The snakemake rules follow the design guidelines.
rules
object (e.g. for input files) are wrapped withstr()
.Every rule in the module is either listed under
localrules
or has thethreads
andresources
directives.Input and output files are being symlinked into the
CFG["inputs"]
andCFG["outputs"]
subdirectories, respectively.I grouped the input symlinking rule to the next job that uses the input files.
I updated the final target rule (
*_all
) to include every output rule.I explained important module design decisions in
CHANGELOG.md
.I tested the module on real data for all supported
seq_type
values.I updated the
default.yaml
configuration file to provide default values for each rule in the module snakefile.I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.
I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).
I replaced every value that should (or might need to) be updated in the default configuration file with
__UPDATE__
.I recursively searched for all comments containing
TODO
to ensure none were left. For example:If applicable
I added more granular output subdirectories.
I added rules to the
reference_files
workflow to generate any new reference files.I added subdirectories with large intermediate files to the list of
scratch_subdirectories
in thedefault.yaml
configuration file.I updated the list of available wildcards for the input files in the
default.yaml
configuration file.Checklist for Updated Module
Important! If you are updating the module version, ensure the previous version of the module is restored from master.
If you want to restore a deleted file or directory from the remote master, you can use
git checkout origin/master path/to/file
,then a
git commit
will ensure that file is tracked on your branch again.Example: