Need for dealing with strides when writing performant code against multiple array libraries #782

rgommers · 2023-06-15T16:37:49Z

rgommers
Jun 15, 2023
Maintainer

Early on when putting the array API standard together, we made a decision that strides must not be exposed in the Python API, because multiple array libraries don't support that and it's memory layout which is more implementation detail than an API that should be exposed to Python users. We made that decision so early that there isn't even a dedicated issue for it I believe; the closest one is gh-571 about C vs. Fortran memory order. It's also related to gh-24 (strides are mentioned explicitly there).

The strides topic came up a couple of times recently though, and we should think about this more:

In the SciPy proceedings paper one of the main review comments was about this: Paper: Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem scipy-conference/scipy_proceedings#822
In the SciPy RFC for array API standard adoption it quickly came up because one of the first functions for which a conversion was tried used as_strided: RFC: SciPy array types & libraries support scipy/scipy#18286 (comment)

Previous comments

I'll copy some of the key comments below to have them in a single place:

From @jpivarski: The thing I was most surprised about in the read-through was that "strides" are not considered part of the array object. I recognize that some libraries (e.g. JAX) are able to do great things by not allowing strides to be free. But then, you show in the SciPy benchmark (Fig. 2c, 2d) that not having this handle implies a severe performance penalty, which is likely to be a major take-away from this paper by most readers. In the discussion, you say that it illustrates that users will need to apply library-specific optimizations after all. (Implementing a backend or runtime switching system is non-goal number 2.) It seems to me like strides-awareness would be a good candidate for one of the optional extensions, if extensions are not strictly sets of functions. Perhaps that's a question or suggestion I should take up with the Consortium's normal channels, not a review of the paper, but it's a question the paper leads me to have.

From @leofang: Regarding strides: I have conflicting views and will continue to self-debate. On the one hand, when over half of the array libraries (Jax, TF, Dask, cuNumeric) do not offer strides for various reasons (ex: it makes no sense for distributed arrays), it's my opinion that even making it optional is not the right way to go, just like many other design considerations that we've made. Also, for those supporting the stride semantics, their strides are accessible via DLPack, it's just that currently it's hard to access them from within pure Python. On the other hand, as a low-level C++ library developer (in short, we write backends for the array libraries), I do see the pain point that it's hard for us to even imagine how our library would be adopted and consumed by the array libraries that do not have strides. It could be possible that our library simply cannot serve as the battery behind them due to different design philosophies, idk. Though, to be fair, the standard aims at the array libraries, not their backends, so if there's a standardization effort targeting array backends, it might be a better place for standardizing strides.

From @tylerjereddy: scipy.signal.welch seemed to be a well-behaved target for prototyping based on that so I tried doing that from scratch on a feature branch locally, progressively allowing more welch() tests to accept CuPy arrays when an env variable is set.
Even for this "well-behaved"/reasonable target, I ran into problems pretty early on. For example, both in the original feature branch and in my branch, there doesn't seem to be an elegant solution made for handling numpy.lib.stride_tricks.as_strided. The docs for that function don't even recommend using it, and yet CuPy (and apparently torch from the Quansight example) do provide it, outside of the array API standard proper.
So, I guess my first real-world experience makes me wonder what our policy on special casing in these scenarios will be--ideally, I'd like to just remove the usage of as_strided() and substitute with some other approach that doesn't require conditionals on the exact array type/namespace. While this is a rather specific blocker, if I encounter something like this even for a "well behaved" case, I can imagine quite a few headaches for the cases that are not well behaved.

From @rgommers: Let's have a look at this case - this is the code in question:

    # Created strided array of data segments
    if nperseg == 1 and noverlap == 0:
        result = x[..., np.newaxis]
    else:
        # https://stackoverflow.com/a/5568169
        step = nperseg - noverlap
        shape = x.shape[:-1]+((x.shape[-1]-noverlap)//step, nperseg)
        strides = x.strides[:-1]+(step*x.strides[-1], x.strides[-1])
        result = np.lib.stride_tricks.as_strided(x, shape=shape,
                                                 strides=strides)

The as_strides usage is there to save memory. It's actually pretty difficult to understand what the code does exactly though, so it's good for performance but not great for maintainability. <.... example with more details cut, see here> At that point, we can say we're happy with more readability at the cost of some performance (TBD if the for-loop matters, my guess is it will). Or, we just keep the special-casing for numpy using as_strided. At that point, we do have two code paths, but no performance loss and also easier to understand code.

To discuss

It's clear that strides cannot be universally supported, and equally clear that users who write algorithmic code may need to access strides in some cases to avoid significant performance loss.

I think essentially, dealing with strides by hand for libraries that allow that is doing the work that in a more full-featured array library will/should be done by a compiler (e.g. the JIT/AOT compilers in JAX or PyTorch).

We need to at least suggest and document some strategy for cases like the as_strided usage above. In the similar but more specifc case of order=, which mattered quite a bit performance-wise for scikit-learn, they ended up writing a library-specific utility function to use the order keyword when present, and avoid it otherwise: https://github.com/scikit-learn/scikit-learn/blob/21312644df0a6b4c6f3c27a74ac9d26cf49c2304/sklearn/utils/_array_api.py#L360

rgommers · 2023-06-15T16:56:42Z

rgommers
Jun 15, 2023
Maintainer Author

Having a look at the usage frequency of as_strided and .strides in Python code:

Scikit-learn:

$ rg strides --type py
sklearn/datasets/tests/test_samples_generator.py
135:            signs = signs.view(dtype="|S{0}".format(signs.strides[0]))

sklearn/feature_extraction/image.py
284:    """Extracts patches of any n-dimensional array in place using strides.
326:    patch_strides = arr.strides
329:    indexing_strides = arr[slices].strides
336:    strides = tuple(list(indexing_strides) + list(patch_strides))
338:    patches = as_strided(arr, shape=shape, strides=strides)


$ rg as_strided --type py
sklearn/feature_extraction/image.py
16:from numpy.lib.stride_tricks import as_strided
338:    patches = as_strided(arr, shape=shape, strides=strides)

SciPy:

$ rg as_strided --type py
signal/_spectral_py.py
1958:        result = np.lib.stride_tricks.as_strided(x, shape=shape,

linalg/_special_matrices.py
5:from numpy.lib.stride_tricks import as_strided
218:    return as_strided(vals[len(c)-1:], shape=out_shp, strides=(-n, n)).copy()
259:    return as_strided(c_ext[L-1:], shape=(L, L), strides=(-n, n)).copy()
316:    return as_strided(vals, shape=out_shp, strides=(n, n)).copy()

signal/_signaltools.py
2124:                zi = np.lib.stride_tricks.as_strided(zi, expected_shape,

signal/tests/test_signaltools.py
445:        a = np.lib.stride_tricks.as_strided(a, shape=(n, 1000), strides=(8008, 8))

linalg/special_matrices.py
12:    'fiedler', 'fiedler_companion', 'convolution_matrix', 'as_strided'

interpolate/tests/test_fitpack.py
347:        from numpy.lib.stride_tricks import as_strided
354:        x = as_strided(np.zeros(()), shape=(size,))

There's some more usages of .strides in SciPy, but not that many either. Cleaned up by hand because the code searches returned comments and .py files that are only used to generate Cython code:

$ rg strides --type py
signal/_signaltools.py
2110:                strides = zi.ndim * [None]
2115:                        strides[k] = zi.strides[k]
2117:                        strides[k] = zi.strides[k]
2119:                        strides[k] = 0
2125:                                                     strides)

linalg/_special_matrices.py
217:    n = vals.strides[0]
218:    return as_strided(vals[len(c)-1:], shape=out_shp, strides=(-n, n)).copy()
258:    n = c_ext.strides[0]
259:    return as_strided(c_ext[L-1:], shape=(L, L), strides=(-n, n)).copy()
315:    n = vals.strides[0]
316:    return as_strided(vals, shape=out_shp, strides=(n, n)).copy()
signal/_spectral_py.py
779:                # Sxx has one additional dimension for time strides
1957:        strides = x.strides[:-1]+(step*x.strides[-1], x.strides[-1])
1959:                                                 strides=strides)

signal/tests/test_signaltools.py
445:        a = np.lib.stride_tricks.as_strided(a, shape=(n, 1000), strides=(8008, 8))
1123:        # numpy arrays that have odd strides. The stride value below gets
1127:        a.strides = 16

interpolate/_ndbspline.py
189:        _strides_c1 = np.asarray([s // c1.dtype.itemsize
190:                                  for s in c1.strides], dtype=np.intp)
202:                                 _strides_c1,

spatial/tests/test_kdtree.py
1296:    assert query_discontiguous.strides[-1] != query_contiguous.strides[-1]
1297:    assert d_discontiguous.strides[-1] != d_contiguous.strides[-1]

spatial/transform/tests/test_rotation.py
1473:    assert all(i >= 0 for i in e1.strides)
1474:    assert all(i >= 0 for i in e2.strides)

0 replies

rgommers · 2023-06-15T18:55:39Z

rgommers
Jun 15, 2023
Maintainer Author

Interestingly enough the second of the as_strided feature requests to JAX was left open and they're willing to implement it: jax#11354. Quoting: "But instead of closing the request like jax#3171, I think we should consider adding an as_strided API, with clear warnings about the lack of operational semantics guarantees. It may be a lower priority item though..."

It may be insightful to take the scipy.signal.welch for-loop implementation above, and see if JAX can actually optimize it to give similar (or even better) performance to the NumPy as_strided equivalent.

0 replies

jpivarski · 2023-06-16T21:38:45Z

jpivarski
Jun 16, 2023

Strides also come up a lot in broadcast implementations:

>>> a = np.arange(2*3*5).reshape(2, 3, 5)
>>> b = np.array(100)
>>> aprime, bprime = np.broadcast_arrays(a, b)
>>> bprime.shape
(2, 3, 5)
>>> bprime.strides
(0, 0, 0)

I'm not sure if this is relevant (because it's an implementation), but in case it is, this is another big way that non-trivial strides enter into array libraries without JIT-compilation/operator fusing.

0 replies

leofang · 2023-06-17T01:42:56Z

leofang
Jun 17, 2023
Collaborator

Strides also come up a lot in broadcast implementations:
...
I'm not sure if this is relevant (because it's an implementation), ...

I don't think so, broadcasting is a separate topic from strides. Only the array shape is relevant to determine broadcasting. For strided array libraries, they do have certain behavior regarding the strides for broadcast arrays, as you pointed out. But, all array libraries, strided or not, can support broadcasting, and thus it's already included in the standard.

FYI this issue has been discussed in the array API meeting this week 🙂 Someone will write up a summary later, so I wouldn't dare attempt to give a poor summary in my own words.

But I do just come up with another reason against standardizing strides. Strides are meaningful only for dense arrays/tensors. Suppose I wanna write an multidimensional array library that is array-API compliant and has dual dense/sparse support under the hood, without users making any decision if an array should be stored in the dense or any of the known sparse format (COO, CSR, ..., it gets complicated for N-D tensors). Then, clearly I simply cannot return strides, because it is an implementation detail of my library that most users should not need to know/rely on.

0 replies

rgommers · 2023-06-17T08:01:04Z

rgommers
Jun 17, 2023
Maintainer Author

I agree with what @leofang says. Everything up to (2, 3, 5) will work for any conforming library, and is unrelated to striding. Only this part is specific:

>>> bprime.strides
(0, 0, 0)

and it's very rarely needed to explicitly access or manipulate strides.

0 replies

seberg · 2023-06-19T08:59:39Z

seberg
Jun 19, 2023

I suspect all of those uses of strides can be written using sliding_window_view; some as broadcast_to. Which probably should be more visible anyway.

In either case, the point is that sliding_window_view promises a view (no copy), but otherwise doesn't refer to strides and doesn't need to. Maybe the no copy promise doesn't need to exist (unspecified), otherwise the only thought I would have is an apply_... function; I am not exactly sure how much it would help.

Memory order can be a useful to ensure or hint (it even makes some sense for sparse as CSC and CSR are basically the same thing as F vs. C).
Allowing order= or preferred_order= or so seems fine, but you will probably want that only as part of arr.__array_namespace__() and not as part of the end-user API? Unless you "hide" it a bit?

0 replies

jpivarski · 2023-06-19T13:54:21Z

jpivarski
Jun 19, 2023

I like this idea—collecting the use-cases that strides have been used for and defining APIs for them instead of the strides themselves. (With 18 years of NumPy and even more for the predecessors, we must have seen all the use-cases by now.)

This originally came up in the context of the paper for SciPy—indicating this as a possible direction would satisfy the needs of the paper. (Maybe after more of the group signs off on it—or maybe you're already in agreement on this and I'm just slow to catch on.)

0 replies

asmeurer · 2023-06-23T20:07:53Z

asmeurer
Jun 23, 2023
Maintainer

FYI this issue has been discussed in the array API meeting this week 🙂 Someone will write up a summary later, so I wouldn't dare attempt to give a poor summary in my own words.

Not a summary, but one point I made there and I should probably mention here too, is that for the SciPy welch implementation, which is the benchmark we included in the paper, the difference between the strides-optimized and the non-strides, purely standard portable version was literally four orders of magnitude (0.039 seconds vs. 124 seconds on PyTorch GPU). Of course, for practicality, SciPy will keep the fast path for NumPy, CuPy, and PyTorch, but that doesn't help if some other library comes along that it doesn't know about.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need for dealing with strides when writing performant code against multiple array libraries #782

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Need for dealing with strides when writing performant code against multiple array libraries #782

rgommers Jun 15, 2023 Maintainer

Previous comments

To discuss

Replies: 8 comments

rgommers Jun 15, 2023 Maintainer Author

rgommers Jun 15, 2023 Maintainer Author

jpivarski Jun 16, 2023

leofang Jun 17, 2023 Collaborator

rgommers Jun 17, 2023 Maintainer Author

seberg Jun 19, 2023

jpivarski Jun 19, 2023

asmeurer Jun 23, 2023 Maintainer

rgommers
Jun 15, 2023
Maintainer

rgommers
Jun 15, 2023
Maintainer Author

rgommers
Jun 15, 2023
Maintainer Author

jpivarski
Jun 16, 2023

leofang
Jun 17, 2023
Collaborator

rgommers
Jun 17, 2023
Maintainer Author

seberg
Jun 19, 2023

jpivarski
Jun 19, 2023

asmeurer
Jun 23, 2023
Maintainer