Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use faster yaml loader and better delayed loading #16

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

toloudis
Copy link
Contributor

@toloudis toloudis commented Sep 14, 2024

Link to Relevant Issue

This pull request resolves #15

Description of Changes

  1. YAML loading is VERY SLOW. Use faster yaml loader.
  2. The way we constructed the delayed load array was also very (unusably) slow. Improve use of delayed loading.

After these changes, most of the time spent is in yaml parsing.

My scratch test script looks like this:

from bioio_base.types import PhysicalPixelSizes
from bioio import BioImage
from bioio.plugins import dump_plugins
from bioio.writers.ome_tiff_writer import OmeTiffWriter
import bioio_sldy
import dask
import numpy
import os
from pathlib import Path

filepath = Path(
    "\\\\allen\\aics\\assay-dev\\MicroscopyData\\John Paul\\2024\\20240305\\20240305_T01_001.sldy"
)
img = BioImage(filepath, reader=bioio_sldy.Reader)
print(img.shape)
print(img.dims)
# now let's extract one timepoint
s = img.get_image_dask_data("CZYX", T=7)
print(s.shape)
t = s.compute()
print(t.shape)

Copy link

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding both the faster YAML load + delayed reads.

I am fine with the new attributes for storing the dims but I think long term we may want to simply add another dims container that stores img.dims_from_meta using the same Dimensions object.

Comment on lines +153 to +160
value = dask.delayed(image.get_data)(
timepoint=timepoint, channel=channel, delayed=True
)
data = da.from_delayed(
value,
shape=(image.sizeZ, image.sizeY, image.sizeX),
dtype=image.dtype,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were previously loading the whole file one timepoint+channel pair at a time??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES 😢 at least it was in memmap mode but still.... the test file is pretty big. Each one of these calls had a noticeable hitch of less than a second but with the delayed code they return basically immediately.

Comment on lines +224 to +230
self.sizeT = self._image_record["CImageRecord70"]["mNumTimepoints"]
self.sizeC = self._image_record["CImageRecord70"]["mNumChannels"]
self.sizeZ = self._image_record["CImageRecord70"]["mNumPlanes"]
self.sizeY = self._image_record["CImageRecord70"]["mHeight"]
self.sizeX = self._image_record["CImageRecord70"]["mWidth"]
# TODO check this but are all sldys uint16? bioformats seems to say so
self.dtype = np.dtype(np.uint16)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick check: have we found any files that have pixel data <-> metadata conflicts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are putting these values into their own attributes because we don't have a system in place for storing pre-read / pre-load dims?

Copy link
Contributor Author

@toloudis toloudis Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I made a ticket in bioio-base about this. Even if the xarray delayed setup is "fast", it would be still faster to return dims from metadata.

We had czi files that had pixel data size different from metadata size but as I recall it was generally because of unexpected termination of the image capture.
I don't know if this happens with other formats.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just move to standardize this sooner rather than later and create a dims_from_meta attribute or something that Readers can optionally impl.

The docs for the attr should just mention that we have no guarantee that the dims are correct due to "early stopping of image acquisition"

@toloudis toloudis merged commit aa6a795 into main Sep 18, 2024
13 checks passed
@toloudis toloudis deleted the fix/optimize-metadata branch September 18, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to get shape, dimensions, or image dask data from very large SLDY files
3 participants