Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the effect of --ncpus #86

Open
quantumdot opened this issue Dec 6, 2022 · 1 comment
Open

Clarify the effect of --ncpus #86

quantumdot opened this issue Dec 6, 2022 · 1 comment

Comments

@quantumdot
Copy link

Can you clarify the exact effect of the CLI parameter --ncpus? Also, what are the recommended settings when running on a compute cluster? The reason I ask is that my university cluster system administrators keep complaining to me that my modelling jobs are over-subscribing CPU resources.

From what I can tell, the --ncpus parameter was simply passed into the model.resample_model() within moseq2_model.train.util.train_model(), and within the model class (for instance ARWeakLimitStickyHDPHMM, this initiates a joblib.Parallel context with n_jobs=ncpus and the multiprocessing backend.

But I recently found within the autoregressive library, that this code automatically parallelizes computation via openMP and native threads (releasing the python GIL). So I suspect there is a situation where different libraries are simultaneously attempting to parallelize work, and thus oversubscribing the CPU cores.

If I ask slurm for 8 cores and pass --ncpus=8, the job is oversubscribed (8 moseq2-model processes with ~90-150 load/%CPU in top).

if I ask slurm for 8 cores and do not pass --ncpus at all, the job is not oversubscribed but utilized most of the 8 cores (one single moseq2-model process with ~760 load/%CPU in top).

Many of the "batching" command generators (for example generating jobs for kappa scan) incorporate ncpus into both the slurm preamble as well as the moseq2-model command.

@versey-sherry
Copy link
Contributor

The --ncpus flag meant to start joblib.Parallel for multiple processing. I checked the code and indeed there is additional parallelized computation via openMP so it looks like the issue comes from simultaneously having parallel computing set up. @wingillis @calebweinreb is this something we are interested in fixing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants