Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input Streaming #578

Open
minw2828 opened this issue Nov 12, 2024 · 3 comments
Open

Input Streaming #578

minw2828 opened this issue Nov 12, 2024 · 3 comments

Comments

@minw2828
Copy link
Member

You probably already know this. Just put a note here as something to consider in the future.

DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.
Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.

ref

@alexiswl
Copy link
Member

Thanks Min, I'm aware of the streaming capabilities of dragen.

See https://github.com/umccr/cwl-ica/blob/main/schemas/fastq-list-row/1.0.0/fastq-list-row__1.0.0.yaml#L26-L42, which allows the read_1 and read_2 attributes to be strings (such as presigned urls) that are then inserted into the fastq list csv.

Unfortunately, streaming capabilities are dependent on both the workflow engine and the orchestration engine.

While cwltool provides the 'streamable' option it's up to the orchestration engine to implement it.

ICAv2 downloads all the inputs into a local 'scratch' space first and then streams the inputs from there.

See the following blockers regarding streaming data on ICAv2 that I've raised:

Furthermore we have done some tests on using presigned urls and streaming and it's actually not that efficient (if at all), the local fsx instance downloads data pretty quickly and has a fast I/O so it's potentially faster than streaming.

@minw2828
Copy link
Member Author

Thanks Alexis. I have learnt something.

Is the orchestration engine from icav2, not cwltool? I don't have access rights to the umccr-illumina/ica_v2 repo. 404

@alexiswl
Copy link
Member

Ah okay I will arrange to fix that re the 404 error.

Yes it is cwltool but they have altered it to be dragen compatible and it runs tasks through a kubernetes wrapper rather than through docker.

It also runs a bunch of non-cwl pre-steps to configure the analysis runtime.

I would say that in my experience that the streaming is no faster unless you're only taking a chunk of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants