Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/minimap2/1.0 #262

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Module/minimap2/1.0 #262

wants to merge 6 commits into from

Conversation

hayashaalan
Copy link
Member

@hayashaalan hayashaalan commented Mar 21, 2023

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for New Module

Required

  • I used the cookiecutter template and updated the placeholder rules.

  • The snakemake rules follow the design guidelines.

    • All references to the rules object (e.g. for input files) are wrapped with str().
  • Every rule in the module is either listed under localrules or has the threads and resources directives.

  • Input and output files are being symlinked into the CFG["inputs"] and CFG["outputs"] subdirectories, respectively.

  • I grouped the input symlinking rule to the next job that uses the input files.

  • I updated the final target rule (*_all) to include every output rule.

  • I explained important module design decisions in CHANGELOG.md.

  • I tested the module on real data for all supported seq_type values.

  • I updated the default.yaml configuration file to provide default values for each rule in the module snakefile.

  • I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.

  • I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).

  • I replaced every value that should (or might need to) be updated in the default configuration file with __UPDATE__.

  • I recursively searched for all comments containing TODO to ensure none were left. For example:

    grep -r TODO modules/<module_name>/1.0

If applicable

  • I added more granular output subdirectories.

  • I added rules to the reference_files workflow to generate any new reference files.

  • I added subdirectories with large intermediate files to the list of scratch_subdirectories in the default.yaml configuration file.

  • I updated the list of available wildcards for the input files in the default.yaml configuration file.

Checklist for Updated Module

Important! If you are updating the module version, ensure the previous version of the module is restored from master.
If you want to restore a deleted file or directory from the remote master, you can use git checkout origin/master path/to/file,
then a git commit will ensure that file is tracked on your branch again.
Example:

mv modules/strelka/1.1 modules/strelka/1.2
git checkout origin/master modules/strelka/1.1

Copy link
Collaborator

@Kdreval Kdreval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hayashaalan !! Great!! I think my biggest confusion is whether (or not) we need to include the utils here - not clear if you run it with or without that module. I assumed without since it is commented. Maybe I am not right?

modules/minimap2/1.0/config/default.yaml Outdated Show resolved Hide resolved
# TODO: Update the list of available wildcards, if applicable
inputs:
# Available wildcards: {seq_type} {genome_build} {sample_id}
sample_fastq: "__UPDATE__"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any indexes required for this tool to work? Maybe we should add them here as inputs as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minimap can work with paired end data - can you think of a way you might be able to adapt this so that it can use paired or unpaired fastq files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No index files are required to run this tool

samtools: "{MODSDIR}/envs/samtools-1.9.yaml"

threads:
minimap2: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder how long this tool takes to run and if we should consider giving it more resources here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it more resources (10) and now it takes ~24 hours

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kostia's point is fair - it makes sense to set the default here MUCH higher. I think you should make 24 the default.

modules/minimap2/1.0/envs/minimap2-2.24.yaml Show resolved Hide resolved
modules/minimap2/1.0/minimap2.smk Outdated Show resolved Hide resolved
input:
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam",
bai = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam.bai",
sorted_bam = str(rules._minimap2_symlink_bam.input.bam)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually the output of the rule is requested as input of the following rule, not the input like in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to delete the unsorted bam once the sorted bam is created by the utils module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - it's not the sorted bam that's being deleted but the un-sorted bam, right?

Duplicate marking isn't necessary for PromethION, but I think it's still an important step for short-read WGS to include in this module. Can you add some rules to complete duplicate marking for short reads? You can use wildcard constraints and an input function to determine which output to symlink depending on the seq_type. (Also, this could be used for capture data in theory, right?)

modules/minimap2/1.0/minimap2.smk Show resolved Hide resolved
modules/minimap2/1.0/minimap2.smk Show resolved Hide resolved
modules/minimap2/CHANGELOG.md Outdated Show resolved Hide resolved
modules/utils/2.1/config/default.yaml Outdated Show resolved Hide resolved
# TODO: Update the list of available wildcards, if applicable
inputs:
# Available wildcards: {seq_type} {genome_build} {sample_id}
sample_fastq: "__UPDATE__"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minimap can work with paired end data - can you think of a way you might be able to adapt this so that it can use paired or unpaired fastq files?

modules/minimap2/1.0/config/default.yaml Outdated Show resolved Hide resolved
modules/minimap2/1.0/config/default.yaml Outdated Show resolved Hide resolved
modules/minimap2/1.0/minimap2.smk Outdated Show resolved Hide resolved
inputs:
# Available wildcards: {seq_type} {genome_build} {sample_id}
sample_fastq: "__UPDATE__"
reference_build: "__UPDATE__"
sample_fastq:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some documentation about the {number} wildcard and how to correctly specify paired and unpaired fastq files.

samtools: "{MODSDIR}/envs/samtools-1.9.yaml"

threads:
minimap2: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kostia's point is fair - it makes sense to set the default here MUCH higher. I think you should make 24 the default.


threads:
minimap2: 4
samtools: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for samtools - make the default higher please!

samtools: 1

resources:
minimap2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient memory allocation for a PromethION whole genome?

@@ -0,0 +1 @@
/projects/rmorin/projects/gambl-repos/gambl-hshaalan/src/lcr-modules/envs/minimap2/minimap2-2.24.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the symlink to point to the lcr-modules/envs/minimap2 directory to parallel the samtools symlink below?

sam = str(rules._minimap2_run.output.sam)
output:
bam = CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam",
complete = touch(CFG["dirs"]["minimap2"] + "{seq_type}--{genome_build}/{sample_id}_out.bam.complete")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kostia's point is that writing the .complete file alone isn't enough. For it to have the desired effect (forcing this rule to re-run if it's interrupted/the bam file is incomplete) it has to be an input to a subsequent rule. Can you add the complete file from this rule as an input to the next rule in the module?

input:
bam = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam",
bai = CFG["dirs"]["sort_bam"] + "{seq_type}--{genome_build}/{sample_id}.sort.bam.bai",
sorted_bam = str(rules._minimap2_symlink_bam.input.bam)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - it's not the sorted bam that's being deleted but the un-sorted bam, right?

Duplicate marking isn't necessary for PromethION, but I think it's still an important step for short-read WGS to include in this module. Can you add some rules to complete duplicate marking for short reads? You can use wildcard constraints and an input function to determine which output to symlink depending on the seq_type. (Also, this could be used for capture data in theory, right?)

@Kdreval
Copy link
Collaborator

Kdreval commented Mar 16, 2024

@hayashaalan just checking if you have any updates for this?

@Kdreval
Copy link
Collaborator

Kdreval commented Jul 26, 2024

@hayashaalan do you have any updates for this module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants