[Dorado] New Dorado Basecalling Workflow Terra #659

fraser-combe · 2024-10-24T02:22:08Z

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

A new Dorado Basecalling Workflow, a GPU-accelerated pipeline for basecalling Oxford Nanopore POD5 files. The workflow includes optional automatic model selection, SAM-to-BAM conversion, and demultiplexing into unique barcode fastq files, with outputs uploaded to a new user defined Terra table for further downstream analysis.

⚡ Impacted Workflows/Tasks

This is a new workflow that does not impact any other workflows

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

This PR introduces the following changes:

New Workflow: Dorado Basecalling Workflow version 1.0.
Optional Inputs: Added use_auto_model flag for automatic model selection.
Manual and Auto Model Options: Supports both predefined models and automatic selection (sup, hac, fast).
SAM-to-BAM Conversion: Integrated SAMTools task for efficient data handling.
Demultiplexing: Added demux step to create barcode-specific FASTQ files.
Terra Integration: Outputs transferred to Terra, with a table generated for downstream workflows.

⚙️ Algorithm

New Tasks:
1. Dorado Basecall: Converts POD5 files to SAM using GPU acceleration. Uses a new Dorado Staph-B Docker image v0.80
  https://github.com/StaPH-B/docker-builds/tree/master/dorado/0.8.0
2. SAMTools Convert: Converts SAM files to BAM.
3. Dorado Demultiplexing: Creates barcode-specific FASTQ files.
4. File Transfer: Uploads FASTQ files to Terra.
5. Terra Table Creation: Generates Terra table from the uploaded FASTQ files.

➡️ Inputs

New Inputs:
- use_auto_model (Boolean): Enables automatic model selection.
- model_accuracy (String): Specifies model accuracy if using auto-selection (sup, hac, fast).
- fastq_file_name (String): Prefix for output FASTQ files.
- fastq_upload_path (String): Path to Terra for uploading FASTQ files.
- kit_name (String): Specifies sequencing kit for adapter/barcode trimming.

⬅️ Outputs

New Outputs:
- basecalled_fastqs: Array of FASTQ files generated from basecalling.
- demuxed_fastqs: Array of FASTQ files generated from demultiplexing.
- logs: Logs generated during the demux step.
- terra_table_tsv: TSV file for uploading to Terra.

🧪 Testing

Tested locally with simulated POD5 inputs and GPU resources.
Tested in Terra with multiple pod5 files - produced expected barcodes
Test 1. With 9 Rabies pod5 files from 2 barcodes (manual model)
-https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/889322c2-19f0-4092-ac7f-4863e676b28a

Test 2. 24 pod5 files from 2 barcodes (manual model)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/9bef28ea-82ba-4406-8545-f32de7e07e02

test 3. 24 files from 2 barcodes (auto mode)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/cead789e-c737-4541-a6ed-d9b907493ee1

output terra table example

Edge Case Handling: Verified workflow behavior with missing inputs and unsupported models.
Terra Integration: Confirmed successful transfer and Terra table generation with sample data.

Suggested Scenarios for Reviewer to Test

Basecalling with Auto Model Selection: Run with the use_auto_model flag enabled.
Manual Model Input: Test with a specific dorado_model path and confirm outputs.
Demultiplexing: Verify barcode-specific FASTQ outputs.
Edge Case: Provide incomplete inputs (e.g., missing kit_name) to confirm error handling.
Terra Table Generation: Confirm Terra table creation and FASTQ uploads with valid inputs.

🔬 Final Developer Checklist

The workflow/task has been tested and results, including file contents, are as anticipated
The CI/CD has been adjusted and tests are passing (Theiagen developers)
Code changes follow the style guide
Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

All changed results have been confirmed
You have tested the PR appropriately (see the testing guide for more information)
All code adheres to the style guide
MD5 sums have been updated
The PR author has addressed all comments
The documentation has been updated

fraser-combe · 2024-11-15T15:31:35Z

outputs working and documentation updated, see ,https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/79417a5e-da8c-4fdc-aa61-f28de8490bba

tasks/basecalling/task_dorado_basecall.wdl

…e used at runtime; improved logging of dorado STDERR to a file; parsed explict model name from STDERR file or accept user input string; added dorado_log task output file

kapsakcj · 2024-11-18T17:43:16Z

I will test 3 different workflows and report back:

using fast as dorado_model input string

wf here: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/c44f45c8-401f-4643-8492-dedcab07d143
~~need to review outputs and logs~~ ⚠️
✅ everything looks good. Correct dorado_model string was output, along with analysis date. FASTQ files copied to correct bucket location and Terra table updated accordingly (although the table was overwritten by the below workflow, as I expected)

using [email protected]

wf here: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/2028a37c-feaa-45b3-9ebf-b405c6ab3468
successful, ~~but need to review outputs and logs~~ ⚠️
✅ everything looks good here as well!

using sup (as this will be the recommended input param for our users)

wf here: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/a9708f03-9732-483b-9c60-e786473defdd
~~need to review outputs and logs~~ ⚠️
✅ Everything looks good here as well. I'm assembling 4 samples with TheiaProk_ONT to confirm output FASTQs are valid and able to be processed. I expect it to pass like last time

EDIT: all of these wfs were run AFTER making the below commit 82a7962 bug fix

tasks/basecalling/task_dorado_demux.wdl

kapsakcj · 2024-11-20T18:03:15Z

TheiaProk_ONT ran successfully on the FASTQs produced by my test above with SUP dorado model 👍 https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/238d0f1f-fe13-4823-8846-b0774fb75e0c

More confirmation the FASTQs produced by this wf are valid for downstream processing

fraser-combe added 30 commits September 30, 2024 15:30

Add Dorado basecalling workflow and update dockstore.yml

a31ee2e

updated logic to deal with arrays

3168761

setting up workflow and task

cc8d9b6

add workflow dorado new

77e97f7

add as dorado_basecall to import

c382193

task path update dorado basecaller

635cb7d

updating naming conventions

9d0e9c4

updating naming conventions

cc60b05

update logic naming funcs to not overlap

63f3eca

update call namings basecall task

ceb7262

update output calls

2d9de34

update gpuType

718795e

add maxretries

1d53a2d

remove sample names and use file name update output prefix

5f87c15

doc update dorado workflow

9d22c74

sort output by barcodexx in filename

a5e102b

remove sample names as input

039bffb

update output folder structure

f2e4586

update output folder structure

86ecc1c

update output file format glob structure

84e6dda

update output file format glob structure with version 1.0 back in

f08b755

trying non glob output

7f11c45

back to glob output folders

ebdee28

update workflow remove ouput prefix and run array through dorado

a20bb58

output folders update

717ccd8

output folders update

5ac06af

version add

1b72f6a

output folders update

518ae52

output folders update adjust output base

02c2cc0

output folders update adjust output base

a0486b1

fraser-combe and others added 21 commits November 14, 2024 09:54

forgot to call task versioning

8c0e983

trying select first on dorado model

66300d5

rearrange output

787f230

update model type string logic basecalling

0a6a172

removed extra dorado model basecall

37fecd6

model_used

a2b8555

model_used

86dbcb1

model_used

f0eb3c4

array string output model name

3f33619

test dorado model handling string

527d706

trying []

c042a66

use demux for dorado_model_used output

5b2bb2a

update log output

978a15e

updating model output logic basecalling

a833f97

try outputting version and model name again

41c41a8

Merge branch 'main' into fc-dorado-workflow-standalone-dev

c94f59d

updating demux

e3df8f6

update model output

b4dd0a7

set dorado_model_used to single string in wf

5d3b91f

samtools version and dorado outputs

1bc7bb0

update output docs

eedfd54

kapsakcj reviewed Nov 16, 2024

View reviewed changes

tasks/basecalling/task_dorado_basecall.wdl Outdated Show resolved Hide resolved

fraser-combe and others added 2 commits November 18, 2024 10:32

update docker image transfer task

dec6ef8

changes to dorado basecall task: added logic for selecting model to b…

3a6488b

…e used at runtime; improved logging of dorado STDERR to a file; parsed explict model name from STDERR file or accept user input string; added dorado_log task output file

fix if statement syntax for parsing dorado log

82a7962

kapsakcj reviewed Nov 20, 2024

View reviewed changes

tasks/basecalling/task_dorado_demux.wdl Outdated Show resolved Hide resolved

minor update log file name demux task

fae4807

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dorado] New Dorado Basecalling Workflow Terra #659

[Dorado] New Dorado Basecalling Workflow Terra #659

fraser-combe commented Oct 24, 2024 •

edited

Loading

fraser-combe commented Nov 15, 2024 •

edited

Loading

kapsakcj commented Nov 18, 2024 •

edited

Loading

kapsakcj commented Nov 20, 2024

[Dorado] New Dorado Basecalling Workflow Terra #659

Are you sure you want to change the base?

[Dorado] New Dorado Basecalling Workflow Terra #659

Conversation

fraser-combe commented Oct 24, 2024 • edited Loading

🧠 Summary

⚡ Impacted Workflows/Tasks

🛠️ Changes

⚙️ Algorithm

➡️ Inputs

⬅️ Outputs

🧪 Testing

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

🎯 Reviewer Checklist

fraser-combe commented Nov 15, 2024 • edited Loading

kapsakcj commented Nov 18, 2024 • edited Loading

kapsakcj commented Nov 20, 2024

fraser-combe commented Oct 24, 2024 •

edited

Loading

fraser-combe commented Nov 15, 2024 •

edited

Loading

kapsakcj commented Nov 18, 2024 •

edited

Loading