Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting %env OMP_NUM_THREADS=1 speeds up the routines in example 11 tremendously #87

Open
uellue opened this issue Oct 16, 2023 · 4 comments

Comments

@uellue
Copy link
Contributor

uellue commented Oct 16, 2023

When matching a larger dataset following example 11 but using lazy arrays, running this as the very first cell before any numerical library is loaded speeds up the calculation on a machine with with 24 cores:

%env OMP_NUM_THREADS=1

When @sk1p profiled the system under load without this setting, it spent most of it's time in sched_yield instead of doing useful work. With this setting enabled (no OpenMP multithreading) it was mostly doing useful work. I didn't benchmark the difference because I ran out of patience, but it is about a factor 10.

Some routines in SciPy and NumPy are multi-threaded internally, for example OpenBLAS. It seems that Dask's/pyxem's parallelism in combination with OpenMP/OpenBLAS threading leads to oversubscription of the CPU or some other kind of scheduling issues. Restricting to only on one level of parallelism resolves this issue.

FYI we encountered a similar issue in LiberTEM. In order to avoid setting the environment variable and disabling threading altogether, we implemented a few context managers to set the thread count to 1 in code blocks that run in parallel: https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/common/threading.py

Maybe that can be useful in HyprSpy/pyxem? Perhaps this should actually be handled in Dask.

@CSSFrancis
Copy link
Member

@uellue This is really useful information and a great help!

I've been suspicious of something like this happening but have never gotten around to determining why that is the case. I would imagine that dask would be very interested in this as well. Is this a problem with dask-distributed as well? I think I usually get fairly good performance with 2-4 threads per process using the distributed backend but the scheduling seems quite a bit slower than I feel it should be.

@uellue
Copy link
Contributor Author

uellue commented Oct 16, 2023

Yes, we had the same issue with dask-distributed. It is not so apparent on small machines, but a big machine will come to a crawling halt. I'm not sure if it will happen with native Dask array operations. To be tested! I'll open an issue in Dask for discussion.

@uellue
Copy link
Contributor Author

uellue commented Oct 16, 2023

@CSSFrancis
Copy link
Member

@uellue Sounds like some better context managers is in order for hyperspy/ pyxem. Thanks for bringing this up!

By the way I am planning on making a couple of changes to the orientation mapping code in the next week or 2. Mostly to simplify the method and let it use dask-distributed so it can use multi gpus. Are there any changes you might be interested in seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants