Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binny step is slow and use a lot of RAM in large dataset #44

Open
kingtom2016 opened this issue Jan 20, 2023 · 6 comments
Open

Binny step is slow and use a lot of RAM in large dataset #44

kingtom2016 opened this issue Jan 20, 2023 · 6 comments

Comments

@kingtom2016
Copy link

I am runing Binny on 18 metagenomes with average depth 20gbps with 20 cores. This step has run 12 days and not complete yet. It also required a lot of RAM (more than 300GB)
How can I speed up this step and use less RAM? like keeping coassembly mode off?

@ohickl
Copy link
Contributor

ohickl commented Jan 24, 2023

Hi,

could you post the log (path/to/outdir/logs/binning_binny.log) and your config file so we can have a look at your setup and the runs progress?

Best
Oskar

@kingtom2016
Copy link
Author

Here is my log and config file content:

06/01/2023 12:56:50 PM - Starting Binny run for sample test.
06/01/2023 01:30:36 PM - Looking for single contig bins.
06/01/2023 01:42:15 PM - Found 0 single contig bins.
06/01/2023 01:42:15 PM - Calculating N90
06/01/2023 01:44:43 PM - N90 is 547, with scMAGs would be 547.
06/01/2023 02:06:23 PM - Masking potentially disruptive sequences from k-mer counting.
06/01/2023 02:09:43 PM - Calculating k-mer frequencies of sizes: 2, 3, 4.

NX_value: 90
bin_quality:
min_completeness: 50
purity: 90
start_completeness: 92.5
clustering:
hdbscan_epsilon_range: 0.250,0.000
hdbscan_min_samples_range: 1,5,10
include_depth_initial: 'False'
include_depth_main: 'False'
coassembly_mode: auto
conda_source: ''
db_path: ''
distance_metric: manhattan
embedding:
max_iterations: 50
extract_scmags: 'True'
kmers: 2,3,4
mantis_env: SemiBin
mask_disruptive_sequences: 'True'
max_cont_length_cutoff: 2250
max_cont_length_cutoff_marker: 2250
max_marker_lineage_depth_lvl: 2
max_n_contigs: 5.0e5
mem:
big_mem_avail: 100
big_mem_per_core_gb: 26
normal_mem_per_core_gb: 16
min_cont_length_cutoff: 2250
min_cont_length_cutoff_marker: 2250
outputdir: tmp/binny_results
prokka_env: SemiBin
raws:
assembly: ASSEMBLY/final_assembly.fasta
contig_depth: ''
metagenomics_alignment: ASSEMBLY/*sort.bam
sample: test
sessionName: TESTRUN_3919871335
snakemake_env: SemiBin
tmp_dir: tmp
write_contig_data: 'True'

@ohickl
Copy link
Contributor

ohickl commented Jan 25, 2023

Thanks.

Something definitely went wrong. Calculating the k-mer frequencies should not take that long. Can you check if there are actually processes running?
How many sequences are in the assembly? You could try sub-sampling it to a small amount, e.g. 50k sequences, to see if it runs at all and if its something with the system.

@kingtom2016
Copy link
Author

Actually, I have tested Binny in other relaitvely small dataset and the system works well.
When running on the large dataset (each sample has totally 1Gbps assembly sequences >500bps generated by megahit), the system also looks normally before Binny step (top command showed a python process running with CPU and RAM resources). I guess large assembly sequence may lead to this RAM and speed problem?

@ohickl
Copy link
Contributor

ohickl commented Jan 30, 2023

Could be, i am still puzzled as to why it would stall at the k-mer counting but not throw any error. I will try to do some tests to see , if i can reproduce it. You could try to limit the assembly to e.g. 1*10^6 or 5*10^5 sequences, if it is much above that to see if that helps in the meantime.

@kingtom2016
Copy link
Author

I notice the warning info:
'/mnt/g/stone_meta/software/binny/conda/631d6f5983d746bd3b67fe54d30e5f94/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py:702: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak."
Is this a clue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants