-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements find_objects #97
Conversation
Provides an implementation of `find_objects` that determines bounding boxes for each label in the image. Currently requires the user to specify the number of labels they would like to inspect. Raises a `NotImplementedError` if the user wishes to collect all bounding boxes for labels. Works by selecting the 1-D positions that correspond to the label while ignoring all other points. Assumes that these positions along with an intermediate array of the same size comfortably fit in memory. Within a utility function, determines whether any positions were found for the corresponding label. If not, simply returns `None`. If positions were found, it manually unravels the positions and finds the maximum and minimum positions along each dimension. These are stored into `slice`s, which are packed into a `tuple` and returned. Makes sure to use in-place NumPy operations to avoid using additional memory.
2eb9d34
to
2e79c5b
Compare
Try testing `find_objects` with some input data that includes labels within a chunk, labels that span across chunks, and labels that are not present. Compare the results of the dask-image function to that of the SciPy function to ensure they match.
2e79c5b
to
8680f07
Compare
"Fixes" is a strong word here. =P |
62475a5
to
659a04d
Compare
d434f70
to
7ea6c45
Compare
Can I get a quick summary of why this branch is stalled? It also would be good to know what you see as needing to happen from here on out. I've just been talking with someone about a use case relating to this PR, so I'd like to see if there's any way we can move it forward. |
Hi, I showed up at the PyCon sprint offering some help and @GenevieveBuckley suggested I look into this. I don't have a good sense of the limits of Dask and what would be considered idiomatic Dask, so I'm putting some info out, in case it's useful. What I Triedn_l = np.random.randint(0, 2, size=(50, 60))
n_l[0, :] = 0
n_l[:, -1] = 0
d_l = da.from_array(n_l, chunks=(10, 12)) Using the code in this PR, I get this out [j.compute() for j in find_objects(d_l, 2)]
# Output
[(slice(0, 41, None), slice(0, 59, None)), None] However, the scipy implementation gives something different (which is what I had expected for the given array) spnd.find_objects(n_l, 2)
# Output
[(slice(1, 50, None), slice(0, 59, None)), None,]
Why this happensThis is incorrect because the implementation of Ravel works (for 2d) by Proposed FixThis works @dask.delayed
def _find_object_new(pos, shape):
if pos.size == 0:
return None
locs = np.unravel_index(pos, shape)
return tuple(slice(l.min(), l.max()+1) for l in locs) With a differently designed top-level function, we could even use @dask.delayed
def _find_object_new(arr, label):
locs = np.where(arr == label)
if locs[0].size == 0:
return None
return tuple(slice(l.min(), l.max()+1) for l in locs) Hope this helps, I can definitely make a PR or look into more implementation details! |
On another note, is this implementation bit memory inefficient? Algorithmically, speaking, merging slices returned by scipy's |
Thanks for investigating how we might get this PR unstuck @nayyarv @jakirkham can I invite you to see if these comments are useful, or if I've confused anyone with my interpretation of what's happening here. I think @nayyarv also mentioned that he didn't run into any issues with dask-delayed (which we thought might be more of a problem), so that's possibly helpful to know. |
Thanks for the feedback. Glad to know this is still of interest to people. I think the gist of why this is stalled is we don't want to use |
So more a philosophical than technical limitation? As in: ok we can do this with delayed, but should we? I think this makes things a bit clearer for me.
There's definitely interest. Mostly the interest is how to build a pipeline and get to regionprops. As I understand things, @jni seems to think this is probably the shortest path for doing that quickly. |
We will likely run into memory issues with any reasonably large image using this approach with |
@jakirkham can you elaborate on this? As far as I can tell, the algorithm is reasonably simple:
As usual, I'm not clear in my head about where to make things delayed and where not, but nothing here should require enormous memory, whether or not we compute eagerly. It would be preferable to not compute eagerly because it might be expensive to instantiate a block, but honestly we could just call compute() on the map_blocks and be done. As far as I can tell that wouldn't blow up the memory? (Again, we are assuming that nlabels << npixels. I don't think there's a way around this.) |
Again chiming in here (with hopefully useful info) @jni I think you and I are coming from the same place - conceptually the merging of the What this PR doesThis PR is using Dask to do the computations. We take the indices of the label we're interested in (in the Reduction ApproachComputing I'm finding this quite tricky to implement as a dask reduction (I suspect that @jakirkham might have already discovered this). Dask reductions don't pass index info along which means we don't have index info to merge. (Conceptually, we get chunk a and chunk b, run the This means of an ThoughtsPut a |
@nayyarv I disagree about reduction being the hacky way to do it. As I understand, you can pass multiple iterators to ... Actually it's even easier, it's already built in. From the API docs:
|
Closed in favour of #240 |
Fixes #96
Provides a very basic implementation of
find_objects
for a label image in a Dask Array.