Investigate possible speedups through Numba #364

matthewfeickert · 2018-11-06T15:52:29Z

Description

As @lukasheinrich has pointed out in team discussions, it would be interesting to see if it would be possible to get speedup from Numba — from the Numba "5 minute guide" the answer seems to be "probably yes". If pyhf can easily work with Numba then tests should be done to study the speedup.

For some references, there was a DIANA/HEP meeting on Numba.

matthewfeickert · 2019-05-17T21:07:39Z

This is a self ping to try to follow up on this in the coming months.

matthewfeickert · 2019-07-13T15:50:35Z

Looking at a few things in the NumPy backend with @seibert it was seen that since most of the backend is just wrappers directly to NumPy calls we unprobable to get speedup there. However, for things like poisson_logpdf

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L181-L184

and poisson

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L208-L210

there could be speedup but SciPy functions are not yet supported by Numba (they might be in the future). For the moment there is https://github.com/person142/numba_special, but it might just be worth waiting 6 months and seeing if Numba gets time to work on this.

lukasheinrich · 2019-07-13T15:59:47Z

yes this is what I suspected. One things that might be worth investigating is whether the slow interpolation codes (which are pure python) could actually become faster than the "fast tensorized" versions once passed through numba. I.e. there are a few nested loops that from a discussion with @jpivarski seem to be a good candidate.

If that is the case, it's not clear how to proceed. We have settled on the array-computation underpinnings for backend portability (switching from e.g. numpy to tensorflow or pytorch) .. so even if we get a speed up from numba, we'd need to think about how to integrate this into pyhf if it requires breaking the "numpy"-ness/tensorlib character of the codebase

matthewfeickert · 2019-07-13T16:10:50Z

Yeah, I can look into this. The number of places where we can get speedup from Numba is quite limited, as all of utils.py won't benefit given that we need

tensorlib, optimizer = get_backend()

in all of them.

matthewfeickert · 2019-07-13T20:03:16Z

Using this script I get

$ git checkout master 
$ python numba_timing_test.py --config examples/2-bin_1-channel.json 
Expected CL_s: [0.07807427911686152, 0.17472571775474582, 0.35998495263681274, 0.6343568235898907, 0.8809947004472013]
Observed CL_s: 0.3599845631401913

Total run time: 164.6846797466278 seconds
Mean time over 5000 tests: 0.03293693594932556 seconds
$ git checkout feature/add-Numba-support 
Switched to branch 'feature/add-Numba-support'
$ python numba_timing_test.py --config examples/2-bin_1-channel.json 
Expected CL_s: [0.07807427911686152, 0.17472571775474582, 0.35998495263681274, 0.6343568235898907, 0.8809947004472013]
Observed CL_s: 0.3599845631401913

Total run time: 166.62471294403076 seconds
Mean time over 5000 tests: 0.03332494258880615 seconds

Which isn't too surprising given that I'm at the moment only able to @jit the normal_logpdf,

https://github.com/diana-hep/pyhf/blob/f7b692c8481e92a095a81e6203c3857ad27a9cbc/pyhf/tensor/numpy_backend.py#L212-L220

so the overhead here probably just isn't worth it given that I'm doing a small 2-bin example.

matthewfeickert · 2019-07-13T20:08:55Z

As not even numba-special has gammaln support

numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'gammaln': cannot determine Numba type of <class 'numpy.ufunc'>

File "pyhf/tensor/numpy_backend.py", line 192:
    def poisson_logpdf(n, lam):
        <source elided>
        lam = np.asarray(lam)
        return n * np.log(lam) - lam - gammaln(n + 1.0)

then I suggest that we just follow this Numba Issue and try again in about 6 months or whenever the Issue gets resolved.

person142 · 2019-07-13T21:36:00Z

Random comment from the peanut gallery, but numba_special does support gammaln:

In [1]: import numba

In [2]: import scipy.special as sc

In [3]: import numba_special

In [4]: import numpy as np

In [5]: @numba.njit
   ...: def poisson_logpdf(n, lam):
   ...:     # More stuff should go here...
   ...:     return n * np.log(lam) - lam - sc.gammaln(n + 1.0)
   ...:

In [6]: poisson_logpdf(1.0, 1.0)
Out[6]: -1.0

Likely what you are seeing is that it specifically only supports gammaln with a float64 argument (no arrays and it also doesn't do any upcasting from integer types, which are all conveniences that ufuncs usually give you).

It would be good to add those things (and it's not hard to do), but I haven't had time to do it yet. Feel free to open an issue. (When scumba comes into existence the numba_special code will likely be copied into it wholesale, so any improvements should propagate.)

jpivarski · 2019-07-13T21:42:55Z

A long time ago, I implemented gammaln and some related functions by hand so that I could vectorize them with Numpy. You don't want to vectorize these because you're using Numba, but unvectorizing is easier than vectorizing.

Here's a link: https://github.com/opendatagroup/augustus/blob/master/trunk/augustus-pmml-library/augustus/core/NumpyInterface.py#L128-L154

It's unfortunately a little obscured by using NP everywhere instead of Numpy itself because we wanted to count the number of times Numpy was called.

The thing you have to do to make Numba efficient is to stay in the compiled code block through the big calculation. In your case, I don't know if it's more compute-limited or data throughput-limited. (I would say compute-limited because it's toy MCs?) But either way, you might have to implement your own gammaln to stay in the compiled code.

Or use C code linked through ctypes or cffi. Numba recognizes these and links in the compiled version without stepping out to Python. But if it's linking into a .so object through ctypes or cffi, it can't inline, as it does with @jit functions calling @jit functions.

lukasheinrich · 2019-07-13T22:38:39Z

thanks @jpivarski -- do you have any type of guidance on what to do wrt vectorized code? A lot of performance critical code in the python ecosystem is using numpy to get C-level performance, and now it seems numba might be able to speed up beyond what you might have achieved with numpy. De-vectorizing all that existing code seems infeasible. Is there some way that this could be automated?

jpivarski · 2019-07-14T12:31:34Z

If you already have a vectorized codebase, then you may be limited by flushing CPU caches where Numpy can't fuse operations. Like

y = g(f(x))

where x is a big array and f and g are vectorzied operations, the first step fills the CPU cache as it goes through all the elements of x and the last elements end up replacing the first, then when g starts with the first, nothing is in cache. That would be if x is much bigger than the most relevant cache (L1, L2, or L3), e.g. more than 10 MB would do it on most architectures.

I have heard (*rumor! I've never confirmed it!) that Numpy now fuses some operations like +. If so, that's great, and this applies to those operations that it doesn't fuse.

To manually fuse them, you can write f∘g as a ufunc:

@numba.vectorize
def fog(x_i):      # function argument takes one element of x
    y_i = ...
    return y_i     # and returns one element of y

and if you need fog to be automatically recognized as a ufunc (e.g. for __array_ufunc__), you should give it an explicit signature:

@numba.vectorize([numba.float64(numba.float64)])

Now the operation of f and g can be done in one pass over the data and is more CPU cache-friendly. Technically, this is de-vectorizing a little at a time.

If we think of the process as operating on a big table (rows are data points, columns are named, typed attributes), there are two ways to de-vectorize: (1) by chunking rows into row groups and (2) by fusing column-at-a-time operations into several-columns-at-a-time operations. Both of these are used to fit into memory constraints: moving data into some memory unit is expensive, so you want to finish everything you're going to do to it before letting it out of the memory unit to get new data in. Pure row-wise processing could either be thought of as "every row is its own row group" or "all operations are a single, fused operation." Columnar processing is relaxing these two constraints as much as memory will allow. In GPU programming, row groups are called "blocks" (you set the block size of every kernel you run) and fusing is putting more steps in a single kernel, rather than defining separate kernels and running them sequentially. In GPU programming, limiting block size (1) is used more than fusing operations (2) because launching a new kernel from C++ is relatively inexpensive. In Python/Numpy, calling a compiled function from Python can be quite expensive, so fusing operations is more useful when you're allowed to do it—when you can have dependencies beyond just Numpy. (Also, in GPU programming, you have complete control over your "cache," the GPU's global memory, but in CPU programming, you don't.)

In all of the above, I've assumed that you're data throughput-limited, not compute-limited. Of course, you should profile your code and watch your memory use to find out where the bottlenecks are and see where you're limited by memory movement. (I wonder if you can view hardware counters from Python, to see if you rack up a lot of cache page misses at some step...)

kratsg · 2019-07-15T04:16:56Z

In all of the above, I've assumed that you're data throughput-limited, not compute-limited. Of course, you should profile your code and watch your memory use to find out where the bottlenecks are and see where you're limited by memory movement. (I wonder if you can view hardware counters from Python, to see if you rack up a lot of cache page misses at some step...)

We have done cProfile at some point, and noted that almost all of the time is spent inside numpy functions for example -- but it's not clear if this included memory movements or not...

matthewfeickert · 2019-07-19T19:21:02Z

Scumba now exists under the Numba project.

lukasheinrich · 2020-11-11T21:32:16Z

some studies here: #1173

matthewfeickert · 2021-02-02T20:16:58Z

For some belated follow up on this, @HDembinski has recently made a numba-stats project which looks like it might be interesting to try to learn from or take advantage of. As Hans notes:

I am aware of numba-scipy, but they don't plan to release fast statistics anytime soon or ever at all numba/numba-scipy#42

matthewfeickert added the research experimental stuff label Nov 6, 2018

matthewfeickert assigned lukasheinrich and matthewfeickert Nov 6, 2018

matthewfeickert changed the title ~~Investigate possible speedups through numba~~ Investigate possible speedups through Numba Nov 6, 2018

matthewfeickert added the follow up label Jul 13, 2019

matthewfeickert mentioned this issue Sep 10, 2019

pyhf 2019 into 2020 Roadmap #561

Open

41 tasks

matthewfeickert added the SciPy Conference Resulted from discussions at a SciPy conference label Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate possible speedups through Numba #364

Investigate possible speedups through Numba #364

matthewfeickert commented Nov 6, 2018 •

edited

Loading

matthewfeickert commented May 17, 2019

matthewfeickert commented Jul 13, 2019

lukasheinrich commented Jul 13, 2019

matthewfeickert commented Jul 13, 2019

matthewfeickert commented Jul 13, 2019 •

edited

Loading

matthewfeickert commented Jul 13, 2019 •

edited

Loading

person142 commented Jul 13, 2019

jpivarski commented Jul 13, 2019

lukasheinrich commented Jul 13, 2019 •

edited

Loading

jpivarski commented Jul 14, 2019

kratsg commented Jul 15, 2019

matthewfeickert commented Jul 19, 2019

lukasheinrich commented Nov 11, 2020

matthewfeickert commented Feb 2, 2021

Investigate possible speedups through Numba #364

Investigate possible speedups through Numba #364

Comments

matthewfeickert commented Nov 6, 2018 • edited Loading

Description

matthewfeickert commented May 17, 2019

matthewfeickert commented Jul 13, 2019

lukasheinrich commented Jul 13, 2019

matthewfeickert commented Jul 13, 2019

matthewfeickert commented Jul 13, 2019 • edited Loading

matthewfeickert commented Jul 13, 2019 • edited Loading

person142 commented Jul 13, 2019

jpivarski commented Jul 13, 2019

lukasheinrich commented Jul 13, 2019 • edited Loading

jpivarski commented Jul 14, 2019

kratsg commented Jul 15, 2019

matthewfeickert commented Jul 19, 2019

lukasheinrich commented Nov 11, 2020

matthewfeickert commented Feb 2, 2021

matthewfeickert commented Nov 6, 2018 •

edited

Loading

matthewfeickert commented Jul 13, 2019 •

edited

Loading

matthewfeickert commented Jul 13, 2019 •

edited

Loading

lukasheinrich commented Jul 13, 2019 •

edited

Loading