Implements find_objects #97

jakirkham · 2019-02-04T00:55:07Z

Fixes #96

Provides a very basic implementation of find_objects for a label image in a Dask Array.

Provides an implementation of `find_objects` that determines bounding boxes for each label in the image. Currently requires the user to specify the number of labels they would like to inspect. Raises a `NotImplementedError` if the user wishes to collect all bounding boxes for labels. Works by selecting the 1-D positions that correspond to the label while ignoring all other points. Assumes that these positions along with an intermediate array of the same size comfortably fit in memory. Within a utility function, determines whether any positions were found for the corresponding label. If not, simply returns `None`. If positions were found, it manually unravels the positions and finds the maximum and minimum positions along each dimension. These are stored into `slice`s, which are packed into a `tuple` and returned. Makes sure to use in-place NumPy operations to avoid using additional memory.

Try testing `find_objects` with some input data that includes labels within a chunk, labels that span across chunks, and labels that are not present. Compare the results of the dask-image function to that of the SciPy function to ensure they match.

jni · 2019-02-04T01:01:35Z

"Fixes" is a strong word here. =P

GenevieveBuckley · 2019-08-02T03:31:02Z

Can I get a quick summary of why this branch is stalled? It also would be good to know what you see as needing to happen from here on out.

I've just been talking with someone about a use case relating to this PR, so I'd like to see if there's any way we can move it forward.

nayyarv · 2019-08-05T07:42:09Z

Hi,

I showed up at the PyCon sprint offering some help and @GenevieveBuckley suggested I look into this.

I don't have a good sense of the limits of Dask and what would be considered idiomatic Dask, so I'm putting some info out, in case it's useful.

What I Tried

n_l = np.random.randint(0, 2, size=(50, 60))
n_l[0, :] = 0
n_l[:, -1] = 0
d_l = da.from_array(n_l, chunks=(10, 12))

Using the code in this PR, I get this out

[j.compute() for j in find_objects(d_l, 2)]
# Output
[(slice(0, 41, None), slice(0, 59, None)), None]

However, the scipy implementation gives something different (which is what I had expected for the given array)

spnd.find_objects(n_l, 2)
# Output
[(slice(1, 50, None), slice(0, 59, None)), None,]

(N, N) matrices for example always give slice(0,1) on the first index.
The correctness for (10, 11) in the tests appears to be a quirk of modulo arithmetic (changing the test to use shape (10, 20) will result in failure for example)

Why this happens

This is incorrect because the implementation of _utils._find_object is not correct. The implementation trying to get the columns and rows uses modulo and subtraction, when it needs an integer division.

Ravel works (for 2d) by index = ncols * r_i + c_i. Hence index % ncols == c_i, but then once we have ncols * r_i after the subtraction, no modulo operator will give us r_i. We have to use division. For 2d, r_i, c_i = divmod(index, ncols) but more generally, we should go in order (row or column), dividing to get the major index and then removing it via subtraction/modulo. np.unravel_index(index, shape) does this for us.

Proposed Fix

This works

@dask.delayed
def _find_object_new(pos, shape):
    if pos.size == 0:
        return None
    locs = np.unravel_index(pos, shape)
    return tuple(slice(l.min(), l.max()+1) for l in locs)

With a differently designed top-level function, we could even use np.where though I'm not sure if this has any drawbacks

@dask.delayed
def _find_object_new(arr, label):
    locs = np.where(arr == label)
    if locs[0].size == 0:
        return None
    return tuple(slice(l.min(), l.max()+1) for l in locs)

Hope this helps, I can definitely make a PR or look into more implementation details!

nayyarv · 2019-08-05T07:48:16Z

On another note, is this implementation bit memory inefficient? positions = _utils._ravel_shape_indices(...) and the positions[input == i], input.shape would create largish intermediate arrays which seems a bit unnecessary (though I'm not sure about Dask's lazy eval).

Algorithmically, speaking, merging slices returned by scipy's find_object in each chunk should be easy, akin to the merge of mergesort, or even as we went across the subarrays, and it might be more memory efficient (if a little more compute heavy), though I have no idea how this'd work in the Dask processing model.

GenevieveBuckley · 2019-08-05T07:50:06Z

Thanks for investigating how we might get this PR unstuck @nayyarv
It'll be really valuable to have this functionality.

@jakirkham can I invite you to see if these comments are useful, or if I've confused anyone with my interpretation of what's happening here.

I think @nayyarv also mentioned that he didn't run into any issues with dask-delayed (which we thought might be more of a problem), so that's possibly helpful to know.

jakirkham · 2019-08-05T09:38:00Z

Thanks for the feedback. Glad to know this is still of interest to people.

I think the gist of why this is stalled is we don't want to use delayed. There may have been some other issues that I'm not recalling ATM. Will try to play with this again and let you know if anything else surfaces.

GenevieveBuckley · 2019-08-05T13:29:43Z

I think the gist of why this is stalled is we don't want to use delayed. There may have been some other issues that I'm not recalling ATM.

So more a philosophical than technical limitation? As in: ok we can do this with delayed, but should we? I think this makes things a bit clearer for me.

Glad to know this is still of interest to people.

There's definitely interest. Mostly the interest is how to build a pipeline and get to regionprops. As I understand things, @jni seems to think this is probably the shortest path for doing that quickly.

jakirkham · 2019-08-05T14:09:13Z

We will likely run into memory issues with any reasonably large image using this approach with delayed.

jni · 2019-08-05T14:25:02Z

We will likely run into memory issues with any reasonably large image using this approach with delayed

@jakirkham can you elaborate on this? As far as I can tell, the algorithm is reasonably simple:

for any chunk, find the min and max extent of all labels. But, convert the list to a dictionary mapping labels to extent
write a function that combines two dictionaries
Finally, find the max key of that dictionary, and put the results in a list.

As usual, I'm not clear in my head about where to make things delayed and where not, but nothing here should require enormous memory, whether or not we compute eagerly. It would be preferable to not compute eagerly because it might be expensive to instantiate a block, but honestly we could just call compute() on the map_blocks and be done. As far as I can tell that wouldn't blow up the memory? (Again, we are assuming that nlabels << npixels. I don't think there's a way around this.)

nayyarv · 2019-08-06T06:52:27Z

Again chiming in here (with hopefully useful info)

@jni I think you and I are coming from the same place - conceptually the merging of the spnd.find_objects output is quite straightforward and should be simple. In fact spnd.find_objects already behaves in a dict like way, since each index represents the slice for that label (i.e. index 0 for label 1, index 3 for label 4) and we could just compare the start's and end's of each slice (by dim) and keep the smallest and largest respectively.

What this PR does

This PR is using Dask to do the computations. We take the indices of the label we're interested in (in the positions[input==i]) and then unravel it so we can then find the min and max indicies (across the dims). This should use the dask underliers to keep memory costs to the minimum, but I can't imagine it's super efficient (multiple uses of the positions array for example and the unravel means we're not using our memory reads well)

Reduction Approach

Computing find_objects on each chunk and merging.

I'm finding this quite tricky to implement as a dask reduction (I suspect that @jakirkham might have already discovered this). Dask reductions don't pass index info along which means we don't have index info to merge. (Conceptually, we get chunk a and chunk b, run the find_object and get back a bunch of slices - if we don't know where these chunks are in an array we can't merge these).

This means of an arg_reduction style approach which is a lot hackier and may introduce more technical debt than I'm comfortable with. Not to mention, I haven't managed to get it to work yet.

Thoughts

Put a region_props implementation along with this and see if the performance of this PR is acceptable. If not, then maybe we need to bite the bullet re reduction approach.

jni · 2019-08-06T14:25:18Z

@nayyarv I disagree about reduction being the hacky way to do it. As I understand, you can pass multiple iterators to map_blocks (?), which means that you can pass an iterator of block coordinates to it, and write your function to take in a block and a block offset.

... Actually it's even easier, it's already built in. From the API docs:

Your block function get information about where it is in the array by accepting a special block_info keyword argument.

  >>> def func(block, block_info=None):
  ...     pass

This will receive the following information:

>>> block_info  # doctest: +SKIP
{0: {'shape': (1000,),
     'num-chunks': (10,),
     'chunk-location': (4,),
     'array-location': [(400, 500)]},
None: {'shape': (1000,),
       'num-chunks': (10,),
       'chunk-location': (4,),
       'array-location': [(400, 500)],
       'chunk-shape': (100,),
       'dtype': dtype('float64')}}

…jects feature branch

GenevieveBuckley · 2021-12-21T08:19:36Z

Closed in favour of #240

jakirkham force-pushed the add_find_objects branch from 2eb9d34 to 2e79c5b Compare February 4, 2019 00:58

jakirkham force-pushed the add_find_objects branch from 2e79c5b to 8680f07 Compare February 4, 2019 01:01

jakirkham force-pushed the add_find_objects branch 4 times, most recently from 62475a5 to 659a04d Compare February 10, 2019 21:45

Merge 'dask/master' into 'jakirkham/add_find_objects'

7ea6c45

jakirkham force-pushed the add_find_objects branch 2 times, most recently from d434f70 to 7ea6c45 Compare February 10, 2019 22:32

jakirkham mentioned this pull request Feb 11, 2019

Add find_objects #96

Closed

GenevieveBuckley added 2 commits August 7, 2019 18:27

Merge branch 'master' into add_find_objects

297cb55

Forgot to change irange to range when merging master into add_find_ob…

8ceda72

…jects feature branch

Base automatically changed from master to main February 2, 2021 01:18

GenevieveBuckley mentioned this pull request May 21, 2021

bundle computation of sets of ndmeasure functions #231

Open

This was referenced Jul 22, 2021

Find objects bounding boxes #240

Merged

block_info kwarg from map_blocks - error when mixing array input with dataframe/series output dask/dask#7921

Closed

GenevieveBuckley closed this Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements find_objects #97

Implements find_objects #97

jakirkham commented Feb 4, 2019

jni commented Feb 4, 2019

GenevieveBuckley commented Aug 2, 2019

nayyarv commented Aug 5, 2019

nayyarv commented Aug 5, 2019

GenevieveBuckley commented Aug 5, 2019

jakirkham commented Aug 5, 2019

GenevieveBuckley commented Aug 5, 2019

jakirkham commented Aug 5, 2019

jni commented Aug 5, 2019

nayyarv commented Aug 6, 2019

jni commented Aug 6, 2019

GenevieveBuckley commented Dec 21, 2021

Implements find_objects #97

Implements find_objects #97

Conversation

jakirkham commented Feb 4, 2019

jni commented Feb 4, 2019

GenevieveBuckley commented Aug 2, 2019

nayyarv commented Aug 5, 2019

What I Tried

Why this happens

Proposed Fix

nayyarv commented Aug 5, 2019

GenevieveBuckley commented Aug 5, 2019

jakirkham commented Aug 5, 2019

GenevieveBuckley commented Aug 5, 2019

jakirkham commented Aug 5, 2019

jni commented Aug 5, 2019

nayyarv commented Aug 6, 2019

What this PR does

Reduction Approach

Thoughts

jni commented Aug 6, 2019

GenevieveBuckley commented Dec 21, 2021