Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/gatk rnaseq/1.0 #184

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Module/gatk rnaseq/1.0 #184

wants to merge 18 commits into from

Conversation

Jwong684
Copy link
Member

@Jwong684 Jwong684 commented May 7, 2021

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for New Module

Required

  • I used the cookiecutter template and updated the placeholder rules.

  • The snakemake rules follow the design guidelines.

    • All references to the rules object (e.g. for input files) are wrapped with str().
  • Every rule in the module is either listed under localrules or has the threads and resources directives.

  • Input and output files are being symlinked into the CFG["inputs"] and CFG["outputs"] subdirectories, respectively.

  • I updated the final target rule (*_all) to include every output rule.

  • I explained important module design decisions in CHANGELOG.md.

  • I tested the module on real data for all supported seq_type values.

  • I updated the default.yaml configuration file to provide default values for each rule in the module snakefile.

  • I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.

  • I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).

  • I replaced every value that should (or might need to) be updated in the default configuration file with __UPDATE__.

  • I recursively searched for all comments containing TODO to ensure none were left. For example:

    grep -r TODO modules/<module_name>/1.0

If applicable

  • I added more granular output subdirectories.

  • I added rules to the reference_files workflow to generate any new reference files.

  • I added subdirectories with large intermediate files to the list of scratch_subdirectories in the default.yaml configuration file.

  • I updated the list of available wildcards for the input files in the default.yaml configuration file.

Checklist for Updated Module

To be completed.

@Jwong684 Jwong684 requested review from rdmorin and Kdreval May 7, 2021 22:29
Copy link
Collaborator

@Kdreval Kdreval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jasper! I left some comments. I have another general question: there is {pair_status} wildcard, but no reference to normal samples in the module. RNASeq would be unpaired, so do we need this wildcard to be included?

demo/config.yaml Outdated
gatk_rnaseq:
inputs:
sample_bam: "data/{sample_id}.bam"
sample_bai: "data/{sample_id}.bam.bai"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs new line added at the end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still shows no new line here, maybe I am looking at outdated file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have missed it. It's fixed

window: 35 # window size between SNPs in cluster
cluster_size: 3 # at least 3 SNPs in cluster
# hard filtering (filters OUT) based on metrics:
# FS (FisherStrand): Phred-scale probability that there is a strand bias from a Fisher's test. (default FS > 30.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed documentation!

bcftools: "{MODSDIR}/envs/bcftools-1.10.2.yaml"

threads:
gatk_splitntrim: 12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to combine these keys and reuse them across rules. In other words, if several rules need the same number of threads, they can refer to the same key in config. Same can be applied to resources as well. I think this reduces the number of keys to specify/adjust if needed, and reduces complexity of the config. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but wouldn't it also cause confusion if there's ever a need to change the numbers? Also, would there be a unique name that could be applied to a subset of the rules ("thread_12" would be non-descriptive and wouldn't indicate which rules would use this parameter)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, let's keep the more detailed and informative names

@@ -0,0 +1,36 @@
name: test-bcftools
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and another environment file) should be rather in lcr-modules/envs/ folder, and symlinked in the module. This will make it easier to reuse same environments across modules, therefore reducing the need to build multiple environments. There is already an environment for bcftools with the same version, so you can just as well symlink lcr-modules/envs/bcftools/bcftools-1.10.2.yaml to this module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,352 @@
name: bioinformatics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this environment just GATK or there is something else besides it that is required for this module to run? There is a GATK environment lcr-modules/envs/gatk/gatk-4.1.8.1.yaml, maybe it can be reused here? I see this one contains tools like, for example, bedtools, vcf2maf, tmux, and a lot of perl dependencies, but are they used in the module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that is a good question. I pulled it from a working environment. I'll test to see if it'll work with the gatk module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it worked with gatk module. I will just use the symlinked version for both gatk and bcftools

stdout = CFG["logs"]["gatk_splitntrim"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.gatk_splitntrim.stdout.log",
stderr = CFG["logs"]["gatk_splitntrim"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.gatk_splitntrim.stderr.log"
params:
mem_mb = lambda wildcards, resources: int(resources.mem_mb / 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tools running java it is sometimes useful to include heat buffer, otherwise they might jump to excessive memory usage and will get killed by slurm. I mean the same approach as you did for the rule that merges vcf files. Also, here you can give -Xmx memory in m, so this calculation may not be necessary. Same can be applied to the rules below as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all -Xmx {params.mem_mb}G to -Xmx {params.mem_mb}m with same calculation (0.8 * assigned mem_mb)

**CFG["resources"]["gatk_base_recalibration"]
shell:
op.as_one_line("""
gatk --java-options "-Xmx{params.mem_mb}G" BaseRecalibrator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you know whether any rule is using java garbage collector, which may by default use all available threads? If so, it may create a lot of competitive usage when launched for many samples simultaneously. One way to get around this is to provide -XX:ConcGCThreads=1 option that basically limits java to only 1 thread for garbage collector. Please ignore this comment if there is no issue with garbage collector!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm not sure about any rules specifically. I do see GATK documentation about some functions sometimes using too many resources. I will add -XX:ConcGCThreads=1 as a pre-emptive measure


rule _gatk_base_recalibration:
input:
bam = rules._gatk_splitntrim.output,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please wrap the inputs referring to the outputs of other rules with str()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

bai = rules._gatk_applybqsr.output.bai,
fasta = reference_files("genomes/{genome_build}/genome_fasta/genome.fa")
output:
vcf = temp(CFG["dirs"]["gatk_variant_calling"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.{chrom}.vcf.gz"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HaplotypeCaller and VariantFiltration compress vcf files by default, or do you need to do that as a separate step? I was having impression that majority of variant calling tools do not compress by default, but may be very wrong!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the output files are gzipped (tested with: file results/gatk_rnaseq-1.0/06-gatk_variant_filtration/mrna--grch37--no_normal/Toledo.filtered.vcf.gz): output: gzip compressed data, extra field

bam = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam",
fasta = reference_files("genomes/{genome_build}/genome_fasta/genome.fa")
output:
bam = temp(CFG["dirs"]["gatk_splitntrim"] + "bam/{seq_type}--{genome_build}--{pair_status}/{sample_id}.split_reassign_mq.bam")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making it seq_type}--{genome_build}/{sample_id}--{pair_status}? This would make it consistent eith the directory structure of other modules

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer making the final file name simple like: Toledo.ouput.filt.vcf.gz.

You're right about no-normal pair status. Pair status just comes out as "no_normal" in the output directories. I'm not sure if it'd be informative or not to keep it in, but I can remove it altogether too

@Jwong684
Copy link
Member Author

Jwong684 commented May 9, 2021

Removed pair-status as a wildcard

Kdreval
Kdreval previously approved these changes May 21, 2021
@Kdreval
Copy link
Collaborator

Kdreval commented Jan 22, 2022

@Jwong684 is this ready to go?

@Jwong684
Copy link
Member Author

This should be completed - I have some output tested on the BL/DLBCL cell lines in GAMBL from way before. I want to add a grouping resource so it doesn't get clogged up with temp bams in the middle of the pipeline

@Kdreval
Copy link
Collaborator

Kdreval commented Jan 22, 2022

Great, thanks! It shows some merge conflicts with the recent update - can you resolve them and we'll get this on master?

@Jwong684 Jwong684 force-pushed the module/gatk_rnaseq/1.0 branch from eb9875d to f422933 Compare February 18, 2022 22:32
@Kdreval
Copy link
Collaborator

Kdreval commented Jul 26, 2024

@Jwong684 what should we do with this? Is this still in works? Should we close this PR? Any plans on continuing this?

@Jwong684
Copy link
Member Author

It is functional but not that great. The final vcf becomes way too big for this to be implemented on a large GAMBL-wide scale. There are just too many SNVs that are called with RNA-seq.

I don't have a particular interest in variants in RNAseq at the moment, but it might be useful as a verification tool. It does capture many of the same SNPs as you would see in WES in the same sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants