-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO: treat arrays with an empty shape like scalars when writing #1783
Conversation
h5py treats arrays with empty shapes as scalars and raises an exception if dataset_kwargs has options incompatible with scalars. So let's treat these arrays like scalars ourselves and filter out incompatible options.
for more information, see https://pre-commit.ci
I'm not sure what's going on with the tests. I've had the exact same tests fail locally even on unmodified master, but then they just started working. In any case, the failing tests seem unrelated to the changes in this PR. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1783 +/- ##
==========================================
- Coverage 87.01% 84.55% -2.47%
==========================================
Files 40 40
Lines 6075 6085 +10
==========================================
- Hits 5286 5145 -141
- Misses 789 940 +151
|
If this statement is true, why isn't the data stored as such i.e., why do we need to extract it? Could you make a reproducer so I can understand the issue better if it's obvious? |
This is an issue when storing some results in AnnData, or when using
In these cases, the dimensionality of the resulting tensor/array may depend on user input, so either all users of |
@ilia-kats I am trying to understand a bit deeper why this change is required, especially:
To my knowledge, we handle scalars, and we handle arrays. import numpy as np
import anndata as ad
import h5py
f = h5py.File("tmp.h5", "r+")
ad.io.write_elem(f, 'foo', np.array(0))
assert np.array(0).shape == () does not error out. And from what I understand, scalars and numpy arrays act very similarly (i.e., both have Could you elaborate a bit on why this change is necessary then i.e., "so either all users of write_elem have to do the check themselves" - why do people have to do this check? |
Your example only works with no In [1]: import anndata as ad
In [2]: import h5py
In [3]: import numpy as np
In [4]: f = h5py.File("test.h5", "w")
In [5]: ad.io.write_elem(f, "foo", np.array(0), dataset_kwargs={"compression": "gzip"})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 ad.io.write_elem(f, "foo", np.array(0), dataset_kwargs={"compression": "gzip"})
File /data/ilia/envs/famo/lib/python3.11/site-packages/anndata/_io/specs/registry.py:488, in write_elem(store, k, elem, dataset_kwargs)
464 def write_elem(
465 store: GroupStorageType,
466 k: str,
(...)
469 dataset_kwargs: Mapping[str, Any] = MappingProxyType({}),
470 ) -> None:
471 """
472 Write an element to a storage group using anndata encoding.
473
(...)
486 E.g. for zarr this would be `chunks`, `compressor`.
487 """
--> 488 Writer(_REGISTRY).write_elem(store, k, elem, dataset_kwargs=dataset_kwargs)
File /data/ilia/envs/famo/lib/python3.11/site-packages/anndata/_io/utils.py:248, in report_write_key_on_error.<locals>.func_wrapper(*args, **kwargs)
246 raise ValueError("No element found in args.")
247 try:
--> 248 return func(*args, **kwargs)
249 except Exception as e:
250 path = _get_display_path(store)
File /data/ilia/envs/famo/lib/python3.11/site-packages/anndata/_io/specs/registry.py:355, in Writer.write_elem(self, store, k, elem, dataset_kwargs, modifiers)
352 write_func = self.find_write_func(dest_type, elem, modifiers)
354 if self.callback is None:
--> 355 return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
356 return self.callback(
357 write_func,
358 store,
(...)
362 iospec=self.registry.get_spec(elem),
363 )
File /data/ilia/envs/famo/lib/python3.11/site-packages/anndata/_io/specs/registry.py:71, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
69 @wraps(func)
70 def wrapper(g: GroupStorageType, k: str, *args, **kwargs):
---> 71 result = func(g, k, *args, **kwargs)
72 g[k].attrs.setdefault("encoding-type", spec.encoding_type)
73 g[k].attrs.setdefault("encoding-version", spec.encoding_version)
File /data/ilia/envs/famo/lib/python3.11/site-packages/anndata/_io/specs/methods.py:394, in write_basic(f, k, elem, _writer, dataset_kwargs)
376 @_REGISTRY.register_write(H5Group, views.ArrayView, IOSpec("array", "0.2.0"))
377 @_REGISTRY.register_write(H5Group, np.ndarray, IOSpec("array", "0.2.0"))
378 @_REGISTRY.register_write(H5Group, h5py.Dataset, IOSpec("array", "0.2.0"))
(...)
391 dataset_kwargs: Mapping[str, Any] = MappingProxyType({}),
392 ):
393 """Write methods which underlying library handles natively."""
--> 394 f.create_dataset(k, data=elem, **dataset_kwargs)
File /data/ilia/envs/famo/lib/python3.11/site-packages/h5py/_hl/group.py:183, in Group.create_dataset(self, name, shape, dtype, data, **kwds)
180 parent_path, name = name.rsplit(b'/', 1)
181 group = self.require_group(parent_path)
--> 183 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
184 dset = dataset.Dataset(dsid)
185 return dset
File /data/ilia/envs/famo/lib/python3.11/site-packages/h5py/_hl/dataset.py:105, in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, dapl, efile_prefix, virtual_prefix, allow_unknown_filter, rdcc_nslots, rdcc_nbytes, rdcc_w0, fill_time)
103 compression_opts = compression
104 compression = 'gzip'
--> 105 dcpl = filters.fill_dcpl(
106 dcpl or h5p.create(h5p.DATASET_CREATE), shape, dtype,
107 chunks, compression, compression_opts, shuffle, fletcher32,
108 maxshape, scaleoffset, external, allow_unknown_filter,
109 fill_time=fill_time)
111 if fillvalue is not None:
112 # prepare string-type dtypes for fillvalue
113 string_info = h5t.check_string_dtype(dtype)
File /data/ilia/envs/famo/lib/python3.11/site-packages/h5py/_hl/filters.py:163, in fill_dcpl(plist, shape, dtype, chunks, compression, compression_opts, shuffle, fletcher32, maxshape, scaleoffset, external, allow_unknown_filter, fill_time)
160 shapetype = 'Empty' if shape is None else 'Scalar'
161 if any((chunks, compression, compression_opts, shuffle, fletcher32,
162 scaleoffset is not None)):
--> 163 raise TypeError(
164 f"{shapetype} datasets don't support chunk/filter options"
165 )
166 if maxshape and maxshape != ():
167 raise TypeError(f"{shapetype} datasets cannot be extended")
TypeError: Scalar datasets don't support chunk/filter options
Error raised while writing key 'foo' of <class 'h5py._hl.files.File'> to / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a towncrier fragment for this please as well? Thanks!
7dcfab4
to
0e04279
Compare
for more information, see https://pre-commit.ci
…calars when writing
…when writing (#1796) Co-authored-by: ilia-kats <[email protected]>
h5py treats arrays with empty shapes as scalars and raises an exception if dataset_kwargs has options incompatible with scalars. So let's treat these arrays like scalars ourselves and filter out incompatible options.