You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you clarify the exact effect of the CLI parameter --ncpus? Also, what are the recommended settings when running on a compute cluster? The reason I ask is that my university cluster system administrators keep complaining to me that my modelling jobs are over-subscribing CPU resources.
From what I can tell, the --ncpus parameter was simply passed into the model.resample_model() within moseq2_model.train.util.train_model(), and within the model class (for instance ARWeakLimitStickyHDPHMM, this initiates a joblib.Parallel context with n_jobs=ncpus and the multiprocessing backend.
But I recently found within the autoregressive library, that this code automatically parallelizes computation via openMP and native threads (releasing the python GIL). So I suspect there is a situation where different libraries are simultaneously attempting to parallelize work, and thus oversubscribing the CPU cores.
If I ask slurm for 8 cores and pass --ncpus=8, the job is oversubscribed (8 moseq2-model processes with ~90-150 load/%CPU in top).
if I ask slurm for 8 cores and do not pass --ncpus at all, the job is not oversubscribed but utilized most of the 8 cores (one single moseq2-model process with ~760 load/%CPU in top).
Many of the "batching" command generators (for example generating jobs for kappa scan) incorporate ncpus into both the slurm preamble as well as the moseq2-model command.
The text was updated successfully, but these errors were encountered:
The --ncpus flag meant to start joblib.Parallel for multiple processing. I checked the code and indeed there is additional parallelized computation via openMP so it looks like the issue comes from simultaneously having parallel computing set up. @wingillis@calebweinreb is this something we are interested in fixing?
Can you clarify the exact effect of the CLI parameter
--ncpus
? Also, what are the recommended settings when running on a compute cluster? The reason I ask is that my university cluster system administrators keep complaining to me that my modelling jobs are over-subscribing CPU resources.From what I can tell, the
--ncpus
parameter was simply passed into themodel.resample_model()
withinmoseq2_model.train.util.train_model()
, and within the model class (for instanceARWeakLimitStickyHDPHMM
, this initiates ajoblib.Parallel
context withn_jobs=ncpus
and the multiprocessing backend.But I recently found within the
autoregressive
library, that this code automatically parallelizes computation via openMP and native threads (releasing the python GIL). So I suspect there is a situation where different libraries are simultaneously attempting to parallelize work, and thus oversubscribing the CPU cores.If I ask slurm for 8 cores and pass
--ncpus=8
, the job is oversubscribed (8 moseq2-model processes with ~90-150 load/%CPU in top).if I ask slurm for 8 cores and do not pass
--ncpus
at all, the job is not oversubscribed but utilized most of the 8 cores (one single moseq2-model process with ~760 load/%CPU in top).Many of the "batching" command generators (for example generating jobs for kappa scan) incorporate
ncpus
into both the slurm preamble as well as themoseq2-model
command.The text was updated successfully, but these errors were encountered: