Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outputDir ties running RNA-seq to local file system - support for cloud file systems #74

Open
TomGardner opened this issue Sep 17, 2020 · 5 comments

Comments

@TomGardner
Copy link

I'm trying to get this running on Google Cloud Platform using https://github.com/broadinstitute/wdl-runner
Unfortunately, the outputDir is pervasive throughout the WDL files and seems to be a show stopper.
I've modified some code to ignore Rna3PairedEnd.yml and that's worked.
But now the issue is with output files and their location.
Now, I'm getting errors like:
Required file output '/cromwell_root/./samples/rna3-paired-end/rna3-paired-end.markdup.bam.md5' does not exist.
Which, on GCP, it wouldn't.
This issue is for an 'enhancement' to support cloud file systems.

@DavyCats
Copy link
Contributor

DavyCats commented Sep 21, 2020

I'm not too familiar with the Google Cloud infrastructure or wdl-runner. Does the . reference to the current directory not exist on Google Cloud (even inside docker containers)?

The file in question is actually an optional output, which is not generated by default, so it's a bit odd it's saying a "Required" file output is missing. Perhaps this is an issue with wdl-runner's localisation (maybe it doesn't support optional outputs).

FYI We have tested this pipeline on AWS, where it ran (mostly) without issue. The only issue we had there was a bug in cromwell which has been fixed to my understanding.

@TomGardner
Copy link
Author

Hello @DavyCats - Ends up there are some utilities that are dealing with the paths quite nicely - so the error that I point out above was actually solved by editing 'Boolean createMd5File = true' instead of false - wherever I found it defined.
Which created the .md5 file that was missing.
I'm having a similar problem, however. With multiqc:
2020/09/22 01:00:41 Delocalizing output /cromwell_root/./multiqc_data.zip -> gs://cromwell-test-runs/gatk/RNA-seq_v4.0.0/work/RNAseq/e2f9dce5-258d-4abb-b454-0cc6cf69c63a/call-multiqcTask/multiqc_data.zip
Seems like a similar issue, where the pipeline is expecting this output, but it is not there.

@DavyCats
Copy link
Contributor

Ah right, in that case setting RNAseq.multiqcTask.dataDir to true will likely resolve the multiqc issue as well.

@TomGardner
Copy link
Author

Hello @DavyCats - I pretty much have it working now - except for variant calling. I've you'd like I can send my changes to you.

A summary of changes are:

  • I updated all entries in the options file to point to GCP - similar to: "RNAseq.referenceFasta": "gs://bucket-name/gatk/RNA-seq_v4.0.0/input/data/reference/reference.fasta"
  • I updated the options file with these:
  "RNAseq.variantCalling": false,
  "RNAseq.lncRNAdetection": false,
  "RNAseq.multiqcTask.dataDir": true,
  "RNAseq.multiqcTask.zipDataDir": true,
  "RNAseq.sampleJobs.markDuplicates.createMd5File": true,
  "RNAseq.preprocessing.gatherBamFiles.createMd5File": true
  • I had to to hard-wire local references to R1.fq.gz and R2.fq.gz in QC.wdl:
        File read1 = "gs://bucket-name/gatk/RNA-seq_v4.0.0/input/data/rna3/R1.fq.gz"
        File? read2 = "gs://bucket-name/gatk/RNA-seq_v4.0.0/input/data/rna3/R2.fq.gz"
  • The variant calling fails with the following error - GCP is trying to copy a local filename to a GCP file name.
    Not sure what to do with this:
2020/09/24 00:18:27 Localizing input gs://bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/scatter-0.bed -> /cromwell_root/cromwell-test-runs/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/scatter-0.bed
Error attempting to localize file with command: 'mkdir -p '/cromwell_root/bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/' && rm -f /root/.config/gcloud/gce && gsutil -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=1' cp 'gs://bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/scatter-0.bed' '/cromwell_root/bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/''
CommandException: No URLs matched: gs://bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/scatter-0.bed

@DavyCats
Copy link
Contributor

Thanks for sharing!

I had to to hard-wire local references to R1.fq.gz and R2.fq.gz in QC.wdl:

Did editing the samplesheet to contain these paths not work?

The variant calling fails with the following error - GCP is trying to copy a local filename to a GCP file name.
Not sure what to do with this:

I'm not sure what's going wrong here either. Does this file gs://bucket-name/gatk/RNA-seq_v4.0.0/work-2020-09-23-3/RNAseq/a4065b11-c4f5-4cb0-b455-ac29a32e2c0b/call-scatterList/scatters/scatter-0.bed exist in the bucket? If not then it sounds like the scatterList job failed, so maybe there is another error somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants