Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible speedups through Numba #364

Open
matthewfeickert opened this issue Nov 6, 2018 · 14 comments
Open

Investigate possible speedups through Numba #364

matthewfeickert opened this issue Nov 6, 2018 · 14 comments
Assignees
Labels
follow up research experimental stuff SciPy Conference Resulted from discussions at a SciPy conference

Comments

@matthewfeickert
Copy link
Member

matthewfeickert commented Nov 6, 2018

Description

As @lukasheinrich has pointed out in team discussions, it would be interesting to see if it would be possible to get speedup from Numba — from the Numba "5 minute guide" the answer seems to be "probably yes". If pyhf can easily work with Numba then tests should be done to study the speedup.

For some references, there was a DIANA/HEP meeting on Numba.

@matthewfeickert matthewfeickert added the research experimental stuff label Nov 6, 2018
@matthewfeickert matthewfeickert changed the title Investigate possible speedups through numba Investigate possible speedups through Numba Nov 6, 2018
@matthewfeickert
Copy link
Member Author

This is a self ping to try to follow up on this in the coming months.

@matthewfeickert
Copy link
Member Author

Looking at a few things in the NumPy backend with @seibert it was seen that since most of the backend is just wrappers directly to NumPy calls we unprobable to get speedup there. However, for things like poisson_logpdf

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L181-L184

and poisson

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L208-L210

there could be speedup but SciPy functions are not yet supported by Numba (they might be in the future). For the moment there is https://github.com/person142/numba_special, but it might just be worth waiting 6 months and seeing if Numba gets time to work on this.

@lukasheinrich
Copy link
Contributor

yes this is what I suspected. One things that might be worth investigating is whether the slow interpolation codes (which are pure python) could actually become faster than the "fast tensorized" versions once passed through numba. I.e. there are a few nested loops that from a discussion with @jpivarski seem to be a good candidate.

If that is the case, it's not clear how to proceed. We have settled on the array-computation underpinnings for backend portability (switching from e.g. numpy to tensorflow or pytorch) .. so even if we get a speed up from numba, we'd need to think about how to integrate this into pyhf if it requires breaking the "numpy"-ness/tensorlib character of the codebase

@matthewfeickert
Copy link
Member Author

Yeah, I can look into this. The number of places where we can get speedup from Numba is quite limited, as all of utils.py won't benefit given that we need

tensorlib, optimizer = get_backend()

in all of them.

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Jul 13, 2019

Using this script I get

$ git checkout master 
$ python numba_timing_test.py --config examples/2-bin_1-channel.json 
Expected CL_s: [0.07807427911686152, 0.17472571775474582, 0.35998495263681274, 0.6343568235898907, 0.8809947004472013]
Observed CL_s: 0.3599845631401913

Total run time: 164.6846797466278 seconds
Mean time over 5000 tests: 0.03293693594932556 seconds
$ git checkout feature/add-Numba-support 
Switched to branch 'feature/add-Numba-support'
$ python numba_timing_test.py --config examples/2-bin_1-channel.json 
Expected CL_s: [0.07807427911686152, 0.17472571775474582, 0.35998495263681274, 0.6343568235898907, 0.8809947004472013]
Observed CL_s: 0.3599845631401913

Total run time: 166.62471294403076 seconds
Mean time over 5000 tests: 0.03332494258880615 seconds

Which isn't too surprising given that I'm at the moment only able to @jit the normal_logpdf,

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L212-L220

so the overhead here probably just isn't worth it given that I'm doing a small 2-bin example.

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Jul 13, 2019

As not even numba-special has gammaln support

numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'gammaln': cannot determine Numba type of <class 'numpy.ufunc'>

File "pyhf/tensor/numpy_backend.py", line 192:
    def poisson_logpdf(n, lam):
        <source elided>
        lam = np.asarray(lam)
        return n * np.log(lam) - lam - gammaln(n + 1.0)

then I suggest that we just follow this Numba Issue and try again in about 6 months or whenever the Issue gets resolved.

@person142
Copy link

Random comment from the peanut gallery, but numba_special does support gammaln:

In [1]: import numba

In [2]: import scipy.special as sc

In [3]: import numba_special

In [4]: import numpy as np

In [5]: @numba.njit
   ...: def poisson_logpdf(n, lam):
   ...:     # More stuff should go here...
   ...:     return n * np.log(lam) - lam - sc.gammaln(n + 1.0)
   ...:

In [6]: poisson_logpdf(1.0, 1.0)
Out[6]: -1.0

Likely what you are seeing is that it specifically only supports gammaln with a float64 argument (no arrays and it also doesn't do any upcasting from integer types, which are all conveniences that ufuncs usually give you).

It would be good to add those things (and it's not hard to do), but I haven't had time to do it yet. Feel free to open an issue. (When scumba comes into existence the numba_special code will likely be copied into it wholesale, so any improvements should propagate.)

@jpivarski
Copy link
Member

A long time ago, I implemented gammaln and some related functions by hand so that I could vectorize them with Numpy. You don't want to vectorize these because you're using Numba, but unvectorizing is easier than vectorizing.

Here's a link: https://github.com/opendatagroup/augustus/blob/master/trunk/augustus-pmml-library/augustus/core/NumpyInterface.py#L128-L154

It's unfortunately a little obscured by using NP everywhere instead of Numpy itself because we wanted to count the number of times Numpy was called.

The thing you have to do to make Numba efficient is to stay in the compiled code block through the big calculation. In your case, I don't know if it's more compute-limited or data throughput-limited. (I would say compute-limited because it's toy MCs?) But either way, you might have to implement your own gammaln to stay in the compiled code.

Or use C code linked through ctypes or cffi. Numba recognizes these and links in the compiled version without stepping out to Python. But if it's linking into a .so object through ctypes or cffi, it can't inline, as it does with @jit functions calling @jit functions.

@lukasheinrich
Copy link
Contributor

lukasheinrich commented Jul 13, 2019

thanks @jpivarski -- do you have any type of guidance on what to do wrt vectorized code? A lot of performance critical code in the python ecosystem is using numpy to get C-level performance, and now it seems numba might be able to speed up beyond what you might have achieved with numpy. De-vectorizing all that existing code seems infeasible. Is there some way that this could be automated?

@jpivarski
Copy link
Member

If you already have a vectorized codebase, then you may be limited by flushing CPU caches where Numpy can't fuse operations. Like

y = g(f(x))

where x is a big array and f and g are vectorzied operations, the first step fills the CPU cache as it goes through all the elements of x and the last elements end up replacing the first, then when g starts with the first, nothing is in cache. That would be if x is much bigger than the most relevant cache (L1, L2, or L3), e.g. more than 10 MB would do it on most architectures.

I have heard (*rumor! I've never confirmed it!) that Numpy now fuses some operations like +. If so, that's great, and this applies to those operations that it doesn't fuse.

To manually fuse them, you can write f∘g as a ufunc:

@numba.vectorize
def fog(x_i):      # function argument takes one element of x
    y_i = ...
    return y_i     # and returns one element of y

and if you need fog to be automatically recognized as a ufunc (e.g. for __array_ufunc__), you should give it an explicit signature:

@numba.vectorize([numba.float64(numba.float64)])

Now the operation of f and g can be done in one pass over the data and is more CPU cache-friendly. Technically, this is de-vectorizing a little at a time.

If we think of the process as operating on a big table (rows are data points, columns are named, typed attributes), there are two ways to de-vectorize: (1) by chunking rows into row groups and (2) by fusing column-at-a-time operations into several-columns-at-a-time operations. Both of these are used to fit into memory constraints: moving data into some memory unit is expensive, so you want to finish everything you're going to do to it before letting it out of the memory unit to get new data in. Pure row-wise processing could either be thought of as "every row is its own row group" or "all operations are a single, fused operation." Columnar processing is relaxing these two constraints as much as memory will allow. In GPU programming, row groups are called "blocks" (you set the block size of every kernel you run) and fusing is putting more steps in a single kernel, rather than defining separate kernels and running them sequentially. In GPU programming, limiting block size (1) is used more than fusing operations (2) because launching a new kernel from C++ is relatively inexpensive. In Python/Numpy, calling a compiled function from Python can be quite expensive, so fusing operations is more useful when you're allowed to do it—when you can have dependencies beyond just Numpy. (Also, in GPU programming, you have complete control over your "cache," the GPU's global memory, but in CPU programming, you don't.)

In all of the above, I've assumed that you're data throughput-limited, not compute-limited. Of course, you should profile your code and watch your memory use to find out where the bottlenecks are and see where you're limited by memory movement. (I wonder if you can view hardware counters from Python, to see if you rack up a lot of cache page misses at some step...)

@kratsg
Copy link
Contributor

kratsg commented Jul 15, 2019

In all of the above, I've assumed that you're data throughput-limited, not compute-limited. Of course, you should profile your code and watch your memory use to find out where the bottlenecks are and see where you're limited by memory movement. (I wonder if you can view hardware counters from Python, to see if you rack up a lot of cache page misses at some step...)

We have done cProfile at some point, and noted that almost all of the time is spent inside numpy functions for example -- but it's not clear if this included memory movements or not...

@matthewfeickert
Copy link
Member Author

Scumba now exists under the Numba project.

@matthewfeickert matthewfeickert added the SciPy Conference Resulted from discussions at a SciPy conference label Jul 7, 2020
@lukasheinrich
Copy link
Contributor

some studies here: #1173

@matthewfeickert
Copy link
Member Author

For some belated follow up on this, @HDembinski has recently made a numba-stats project which looks like it might be interesting to try to learn from or take advantage of. As Hans notes:

I am aware of numba-scipy, but they don't plan to release fast statistics anytime soon or ever at all numba/numba-scipy#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
follow up research experimental stuff SciPy Conference Resulted from discussions at a SciPy conference
Projects
None yet
Development

No branches or pull requests

5 participants