-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module/gatk rnaseq/1.0 #184
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jasper! I left some comments. I have another general question: there is {pair_status}
wildcard, but no reference to normal samples in the module. RNASeq would be unpaired, so do we need this wildcard to be included?
demo/config.yaml
Outdated
gatk_rnaseq: | ||
inputs: | ||
sample_bam: "data/{sample_id}.bam" | ||
sample_bai: "data/{sample_id}.bam.bai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs new line added at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still shows no new line here, maybe I am looking at outdated file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must have missed it. It's fixed
window: 35 # window size between SNPs in cluster | ||
cluster_size: 3 # at least 3 SNPs in cluster | ||
# hard filtering (filters OUT) based on metrics: | ||
# FS (FisherStrand): Phred-scale probability that there is a strand bias from a Fisher's test. (default FS > 30.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed documentation!
bcftools: "{MODSDIR}/envs/bcftools-1.10.2.yaml" | ||
|
||
threads: | ||
gatk_splitntrim: 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to combine these keys and reuse them across rules. In other words, if several rules need the same number of threads, they can refer to the same key in config. Same can be applied to resources as well. I think this reduces the number of keys to specify/adjust if needed, and reduces complexity of the config. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, but wouldn't it also cause confusion if there's ever a need to change the numbers? Also, would there be a unique name that could be applied to a subset of the rules ("thread_12" would be non-descriptive and wouldn't indicate which rules would use this parameter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, let's keep the more detailed and informative names
@@ -0,0 +1,36 @@ | |||
name: test-bcftools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and another environment file) should be rather in lcr-modules/envs/ folder, and symlinked in the module. This will make it easier to reuse same environments across modules, therefore reducing the need to build multiple environments. There is already an environment for bcftools with the same version, so you can just as well symlink lcr-modules/envs/bcftools/bcftools-1.10.2.yaml
to this module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,352 @@ | |||
name: bioinformatics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this environment just GATK or there is something else besides it that is required for this module to run? There is a GATK environment lcr-modules/envs/gatk/gatk-4.1.8.1.yaml
, maybe it can be reused here? I see this one contains tools like, for example, bedtools, vcf2maf, tmux, and a lot of perl dependencies, but are they used in the module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm that is a good question. I pulled it from a working environment. I'll test to see if it'll work with the gatk module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it worked with gatk module. I will just use the symlinked version for both gatk and bcftools
stdout = CFG["logs"]["gatk_splitntrim"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.gatk_splitntrim.stdout.log", | ||
stderr = CFG["logs"]["gatk_splitntrim"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.gatk_splitntrim.stderr.log" | ||
params: | ||
mem_mb = lambda wildcards, resources: int(resources.mem_mb / 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tools running java it is sometimes useful to include heat buffer, otherwise they might jump to excessive memory usage and will get killed by slurm. I mean the same approach as you did for the rule that merges vcf files. Also, here you can give -Xmx memory in m
, so this calculation may not be necessary. Same can be applied to the rules below as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all -Xmx {params.mem_mb}G to -Xmx {params.mem_mb}m with same calculation (0.8 * assigned mem_mb)
**CFG["resources"]["gatk_base_recalibration"] | ||
shell: | ||
op.as_one_line(""" | ||
gatk --java-options "-Xmx{params.mem_mb}G" BaseRecalibrator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you know whether any rule is using java garbage collector, which may by default use all available threads? If so, it may create a lot of competitive usage when launched for many samples simultaneously. One way to get around this is to provide -XX:ConcGCThreads=1
option that basically limits java to only 1 thread for garbage collector. Please ignore this comment if there is no issue with garbage collector!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'm not sure about any rules specifically. I do see GATK documentation about some functions sometimes using too many resources. I will add -XX:ConcGCThreads=1
as a pre-emptive measure
|
||
rule _gatk_base_recalibration: | ||
input: | ||
bam = rules._gatk_splitntrim.output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please wrap the inputs referring to the outputs of other rules with str()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bai = rules._gatk_applybqsr.output.bai, | ||
fasta = reference_files("genomes/{genome_build}/genome_fasta/genome.fa") | ||
output: | ||
vcf = temp(CFG["dirs"]["gatk_variant_calling"] + "{seq_type}--{genome_build}--{pair_status}/{sample_id}.{chrom}.vcf.gz"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HaplotypeCaller and VariantFiltration compress vcf files by default, or do you need to do that as a separate step? I was having impression that majority of variant calling tools do not compress by default, but may be very wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the output files are gzipped (tested with: file results/gatk_rnaseq-1.0/06-gatk_variant_filtration/mrna--grch37--no_normal/Toledo.filtered.vcf.gz
): output: gzip compressed data, extra field
bam = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam", | ||
fasta = reference_files("genomes/{genome_build}/genome_fasta/genome.fa") | ||
output: | ||
bam = temp(CFG["dirs"]["gatk_splitntrim"] + "bam/{seq_type}--{genome_build}--{pair_status}/{sample_id}.split_reassign_mq.bam") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making it seq_type}--{genome_build}/{sample_id}--{pair_status}
? This would make it consistent eith the directory structure of other modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer making the final file name simple like: Toledo.ouput.filt.vcf.gz.
You're right about no-normal pair status. Pair status just comes out as "no_normal" in the output directories. I'm not sure if it'd be informative or not to keep it in, but I can remove it altogether too
Removed pair-status as a wildcard |
… fixed a final gnomad-filtration step.
@Jwong684 is this ready to go? |
This should be completed - I have some output tested on the BL/DLBCL cell lines in GAMBL from way before. I want to add a grouping resource so it doesn't get clogged up with temp bams in the middle of the pipeline |
Great, thanks! It shows some merge conflicts with the recent update - can you resolve them and we'll get this on master? |
eb9875d
to
f422933
Compare
@Jwong684 what should we do with this? Is this still in works? Should we close this PR? Any plans on continuing this? |
It is functional but not that great. The final vcf becomes way too big for this to be implemented on a large GAMBL-wide scale. There are just too many SNVs that are called with RNA-seq. I don't have a particular interest in variants in RNAseq at the moment, but it might be useful as a verification tool. It does capture many of the same SNPs as you would see in WES in the same sample. |
Pull Request Checklists
Important: When opening a pull request, keep only the applicable checklist and delete all other sections.
Checklist for New Module
Required
I used the cookiecutter template and updated the placeholder rules.
The snakemake rules follow the design guidelines.
rules
object (e.g. for input files) are wrapped withstr()
.Every rule in the module is either listed under
localrules
or has thethreads
andresources
directives.Input and output files are being symlinked into the
CFG["inputs"]
andCFG["outputs"]
subdirectories, respectively.I updated the final target rule (
*_all
) to include every output rule.I explained important module design decisions in
CHANGELOG.md
.I tested the module on real data for all supported
seq_type
values.I updated the
default.yaml
configuration file to provide default values for each rule in the module snakefile.I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.
I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).
I replaced every value that should (or might need to) be updated in the default configuration file with
__UPDATE__
.I recursively searched for all comments containing
TODO
to ensure none were left. For example:If applicable
I added more granular output subdirectories.
I added rules to the
reference_files
workflow to generate any new reference files.I added subdirectories with large intermediate files to the list of
scratch_subdirectories
in thedefault.yaml
configuration file.I updated the list of available wildcards for the input files in the
default.yaml
configuration file.Checklist for Updated Module
To be completed.