Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements find_objects #97

Closed
wants to merge 5 commits into from
Closed

Conversation

jakirkham
Copy link
Member

Fixes #96

Provides a very basic implementation of find_objects for a label image in a Dask Array.

Provides an implementation of `find_objects` that determines bounding
boxes for each label in the image. Currently requires the user to
specify the number of labels they would like to inspect. Raises a
`NotImplementedError` if the user wishes to collect all bounding boxes
for labels. Works by selecting the 1-D positions that correspond to the
label while ignoring all other points. Assumes that these positions
along with an intermediate array of the same size comfortably fit in
memory.

Within a utility function, determines whether any positions were found
for the corresponding label. If not, simply returns `None`. If positions
were found, it manually unravels the positions and finds the maximum and
minimum positions along each dimension. These are stored into `slice`s,
which are packed into a `tuple` and returned. Makes sure to use in-place
NumPy operations to avoid using additional memory.
Try testing `find_objects` with some input data that includes labels
within a chunk, labels that span across chunks, and labels that are not
present. Compare the results of the dask-image function to that of the
SciPy function to ensure they match.
@jni
Copy link
Contributor

jni commented Feb 4, 2019

"Fixes" is a strong word here. =P

@jakirkham jakirkham force-pushed the add_find_objects branch 4 times, most recently from 62475a5 to 659a04d Compare February 10, 2019 21:45
@jakirkham jakirkham force-pushed the add_find_objects branch 2 times, most recently from d434f70 to 7ea6c45 Compare February 10, 2019 22:32
@jakirkham jakirkham mentioned this pull request Feb 11, 2019
@GenevieveBuckley
Copy link
Collaborator

Can I get a quick summary of why this branch is stalled? It also would be good to know what you see as needing to happen from here on out.

I've just been talking with someone about a use case relating to this PR, so I'd like to see if there's any way we can move it forward.

@nayyarv
Copy link

nayyarv commented Aug 5, 2019

Hi,

I showed up at the PyCon sprint offering some help and @GenevieveBuckley suggested I look into this.

I don't have a good sense of the limits of Dask and what would be considered idiomatic Dask, so I'm putting some info out, in case it's useful.

What I Tried

n_l = np.random.randint(0, 2, size=(50, 60))
n_l[0, :] = 0
n_l[:, -1] = 0
d_l = da.from_array(n_l, chunks=(10, 12))

Using the code in this PR, I get this out

[j.compute() for j in find_objects(d_l, 2)]
# Output
[(slice(0, 41, None), slice(0, 59, None)), None]

However, the scipy implementation gives something different (which is what I had expected for the given array)

spnd.find_objects(n_l, 2)
# Output
[(slice(1, 50, None), slice(0, 59, None)), None,]
  • (N, N) matrices for example always give slice(0,1) on the first index.
  • The correctness for (10, 11) in the tests appears to be a quirk of modulo arithmetic (changing the test to use shape (10, 20) will result in failure for example)

Why this happens

This is incorrect because the implementation of _utils._find_object is not correct. The implementation trying to get the columns and rows uses modulo and subtraction, when it needs an integer division.

Ravel works (for 2d) by index = ncols * r_i + c_i. Hence index % ncols == c_i, but then once we have ncols * r_i after the subtraction, no modulo operator will give us r_i. We have to use division. For 2d, r_i, c_i = divmod(index, ncols) but more generally, we should go in order (row or column), dividing to get the major index and then removing it via subtraction/modulo. np.unravel_index(index, shape) does this for us.

Proposed Fix

This works

@dask.delayed
def _find_object_new(pos, shape):
    if pos.size == 0:
        return None
    locs = np.unravel_index(pos, shape)
    return tuple(slice(l.min(), l.max()+1) for l in locs)

With a differently designed top-level function, we could even use np.where though I'm not sure if this has any drawbacks

@dask.delayed
def _find_object_new(arr, label):
    locs = np.where(arr == label)
    if locs[0].size == 0:
        return None
    return tuple(slice(l.min(), l.max()+1) for l in locs)

Hope this helps, I can definitely make a PR or look into more implementation details!

@nayyarv
Copy link

nayyarv commented Aug 5, 2019

On another note, is this implementation bit memory inefficient? positions = _utils._ravel_shape_indices(...) and the positions[input == i], input.shape would create largish intermediate arrays which seems a bit unnecessary (though I'm not sure about Dask's lazy eval).

Algorithmically, speaking, merging slices returned by scipy's find_object in each chunk should be easy, akin to the merge of mergesort, or even as we went across the subarrays, and it might be more memory efficient (if a little more compute heavy), though I have no idea how this'd work in the Dask processing model.

@GenevieveBuckley
Copy link
Collaborator

Thanks for investigating how we might get this PR unstuck @nayyarv
It'll be really valuable to have this functionality.

@jakirkham can I invite you to see if these comments are useful, or if I've confused anyone with my interpretation of what's happening here.

I think @nayyarv also mentioned that he didn't run into any issues with dask-delayed (which we thought might be more of a problem), so that's possibly helpful to know.

@jakirkham
Copy link
Member Author

Thanks for the feedback. Glad to know this is still of interest to people.

I think the gist of why this is stalled is we don't want to use delayed. There may have been some other issues that I'm not recalling ATM. Will try to play with this again and let you know if anything else surfaces.

@GenevieveBuckley
Copy link
Collaborator

I think the gist of why this is stalled is we don't want to use delayed. There may have been some other issues that I'm not recalling ATM.

So more a philosophical than technical limitation? As in: ok we can do this with delayed, but should we? I think this makes things a bit clearer for me.

Glad to know this is still of interest to people.

There's definitely interest. Mostly the interest is how to build a pipeline and get to regionprops. As I understand things, @jni seems to think this is probably the shortest path for doing that quickly.

@jakirkham
Copy link
Member Author

We will likely run into memory issues with any reasonably large image using this approach with delayed.

@jni
Copy link
Contributor

jni commented Aug 5, 2019

We will likely run into memory issues with any reasonably large image using this approach with delayed

@jakirkham can you elaborate on this? As far as I can tell, the algorithm is reasonably simple:

  • for any chunk, find the min and max extent of all labels. But, convert the list to a dictionary mapping labels to extent
  • write a function that combines two dictionaries
  • Finally, find the max key of that dictionary, and put the results in a list.

As usual, I'm not clear in my head about where to make things delayed and where not, but nothing here should require enormous memory, whether or not we compute eagerly. It would be preferable to not compute eagerly because it might be expensive to instantiate a block, but honestly we could just call compute() on the map_blocks and be done. As far as I can tell that wouldn't blow up the memory? (Again, we are assuming that nlabels << npixels. I don't think there's a way around this.)

@nayyarv
Copy link

nayyarv commented Aug 6, 2019

Again chiming in here (with hopefully useful info)

@jni I think you and I are coming from the same place - conceptually the merging of the spnd.find_objects output is quite straightforward and should be simple. In fact spnd.find_objects already behaves in a dict like way, since each index represents the slice for that label (i.e. index 0 for label 1, index 3 for label 4) and we could just compare the start's and end's of each slice (by dim) and keep the smallest and largest respectively.

What this PR does

This PR is using Dask to do the computations. We take the indices of the label we're interested in (in the positions[input==i]) and then unravel it so we can then find the min and max indicies (across the dims). This should use the dask underliers to keep memory costs to the minimum, but I can't imagine it's super efficient (multiple uses of the positions array for example and the unravel means we're not using our memory reads well)

Reduction Approach

Computing find_objects on each chunk and merging.

I'm finding this quite tricky to implement as a dask reduction (I suspect that @jakirkham might have already discovered this). Dask reductions don't pass index info along which means we don't have index info to merge. (Conceptually, we get chunk a and chunk b, run the find_object and get back a bunch of slices - if we don't know where these chunks are in an array we can't merge these).

This means of an arg_reduction style approach which is a lot hackier and may introduce more technical debt than I'm comfortable with. Not to mention, I haven't managed to get it to work yet.

Thoughts

Put a region_props implementation along with this and see if the performance of this PR is acceptable. If not, then maybe we need to bite the bullet re reduction approach.

@jni
Copy link
Contributor

jni commented Aug 6, 2019

@nayyarv I disagree about reduction being the hacky way to do it. As I understand, you can pass multiple iterators to map_blocks (?), which means that you can pass an iterator of block coordinates to it, and write your function to take in a block and a block offset.

... Actually it's even easier, it's already built in. From the API docs:

Your block function get information about where it is in the array by accepting a special block_info keyword argument.

  >>> def func(block, block_info=None):
  ...     pass

This will receive the following information:

>>> block_info  # doctest: +SKIP
{0: {'shape': (1000,),
     'num-chunks': (10,),
     'chunk-location': (4,),
     'array-location': [(400, 500)]},
None: {'shape': (1000,),
       'num-chunks': (10,),
       'chunk-location': (4,),
       'array-location': [(400, 500)],
       'chunk-shape': (100,),
       'dtype': dtype('float64')}}

@GenevieveBuckley
Copy link
Collaborator

Closed in favour of #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add find_objects
4 participants