Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading approach #55

Open
Teklu67 opened this issue Mar 28, 2022 · 8 comments
Open

Multi-threading approach #55

Teklu67 opened this issue Mar 28, 2022 · 8 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@Teklu67
Copy link

Teklu67 commented Mar 28, 2022

Hi,
This is a very useful program but it is taking long time to sub-sample from a large fastq file. I am running it on a server and would like to run it using multi-threading but I am novice to programming and not sure how to do that. Any help please?
Thanks,

@mbhall88
Copy link
Owner

Hi @Teklu67. When you say "a long time", how long are we talking? And how large is your file?

@Teklu67
Copy link
Author

Teklu67 commented Mar 29, 2022

Thanks so much for the quick response. It finished sampling 30x from a fq of 690 Gb (60x coverage) in 2 days. Because I have the resources to run using several threads I thought it will finish much faster if there was an option for multi-threading. Thanks!

@mbhall88
Copy link
Owner

Wow, that's a very big fastq file! Is it compressed (e.g., gzip)?

How did you install rasusa?

@Teklu67
Copy link
Author

Teklu67 commented Apr 1, 2022

Yes it is for tetraploid wheat and compressed .gz format. I installed it through conda.

@mbhall88
Copy link
Owner

mbhall88 commented Apr 2, 2022

Is your data Illumina?

There's not really too much I can offer in the way of speeding rasusa up sorry.

At some point I will look into whether multi-threading the IO is possible (i.e. batching reads).

I'll leave this open and add it to my list of things to investigate in the coming months. Sorry, I can't do it faster, but have a lot of other research projects I am trying to juggle.

However, if you (or anyone else) would like to have a go at it, I would be very happy to receive a pull request.

@mbhall88 mbhall88 added enhancement New feature or request help wanted Extra attention is needed labels Apr 2, 2022
@Teklu67
Copy link
Author

Teklu67 commented Apr 5, 2022

It is ONT data. That is ok, thank you for your time

@mbhall88
Copy link
Owner

mbhall88 commented Apr 6, 2022

In the mean time, I would suggest maybe trying to split the file up into subsets, and then randomly subsample each subset.

@mbhall88
Copy link
Owner

Another suggestion: I suspect most of the runtime is (de)compressing the data. Switching to zstd instead of gzip should drastically improve time spent on decompression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants