-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use faster yaml loader and better delayed loading #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding both the faster YAML load + delayed reads.
I am fine with the new attributes for storing the dims but I think long term we may want to simply add another dims container that stores img.dims_from_meta
using the same Dimensions
object.
value = dask.delayed(image.get_data)( | ||
timepoint=timepoint, channel=channel, delayed=True | ||
) | ||
data = da.from_delayed( | ||
value, | ||
shape=(image.sizeZ, image.sizeY, image.sizeX), | ||
dtype=image.dtype, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were previously loading the whole file one timepoint+channel pair at a time??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES 😢 at least it was in memmap mode but still.... the test file is pretty big. Each one of these calls had a noticeable hitch of less than a second but with the delayed code they return basically immediately.
self.sizeT = self._image_record["CImageRecord70"]["mNumTimepoints"] | ||
self.sizeC = self._image_record["CImageRecord70"]["mNumChannels"] | ||
self.sizeZ = self._image_record["CImageRecord70"]["mNumPlanes"] | ||
self.sizeY = self._image_record["CImageRecord70"]["mHeight"] | ||
self.sizeX = self._image_record["CImageRecord70"]["mWidth"] | ||
# TODO check this but are all sldys uint16? bioformats seems to say so | ||
self.dtype = np.dtype(np.uint16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick check: have we found any files that have pixel data <-> metadata conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you are putting these values into their own attributes because we don't have a system in place for storing pre-read / pre-load dims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I made a ticket in bioio-base about this. Even if the xarray delayed setup is "fast", it would be still faster to return dims from metadata.
We had czi files that had pixel data size different from metadata size but as I recall it was generally because of unexpected termination of the image capture.
I don't know if this happens with other formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just move to standardize this sooner rather than later and create a dims_from_meta
attribute or something that Readers can optionally impl.
The docs for the attr should just mention that we have no guarantee that the dims are correct due to "early stopping of image acquisition"
Link to Relevant Issue
This pull request resolves #15
Description of Changes
After these changes, most of the time spent is in yaml parsing.
My scratch test script looks like this: