Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binary/opaque dtype #34

Open
rly opened this issue May 11, 2024 · 2 comments
Open

Add binary/opaque dtype #34

rly opened this issue May 11, 2024 · 2 comments
Assignees
Labels
category: proposal discussion of proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@rly
Copy link
Contributor

rly commented May 11, 2024

Related to NeurodataWithoutBorders/nwb-schema#574 to allow the storage of raw binary data that follows a particular format, e.g., MP4, PNG.

In the hdmf schema language, dtype "bytes" maps to variable length string with ascii encoding.
In HDMF, if I try to write a MP4 byte stream with dtype "bytes" to an HDF5 file, I get the error ValueError: VLEN strings do not support embedded NULLs.

Here is the error with a simple h5py-based exmple:

import h5py
f = h5py.File("test.h5", "w")
f.create_dataset(name="data", data=video_data, dtype=h5py.string_dtype('ascii'))
# NOTE: h5py.string_dtype('ascii') is equivalent to h5py.special_dtype(vlen=bytes)
# NOTE: f.create_dataset(name="data", data=video_data) assumes the data is a string and will return the same error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 166, in make_new_dset
    dset_id.write(h5s.ALL, h5s.ALL, data)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 282, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 147, in h5py._proxy.dset_rw
  File "h5py/_conv.pyx", line 442, in h5py._conv.str2vlen
  File "h5py/_conv.pyx", line 96, in h5py._conv.generic_converter
  File "h5py/_conv.pyx", line 254, in h5py._conv.conv_str2vlen
ValueError: VLEN strings do not support embedded NULLs

H5py docs recommend against storing raw binary data as variable length strings with an encoding. It says:

If you have a non-text blob in a Python byte string (as opposed to ASCII or UTF-8 encoded text, which is fine), you should wrap it in a void type for storage. This will map to the HDF5 OPAQUE datatype, and will prevent your blob from getting mangled by the string machinery.

To enable storage of raw binary data, I propose we add a new dtype to the schema language that maps to HDF5 OPAQUE / void dtype. We can't use the dtype name "bytes" because we use that for ascii data. What about "binary"?

>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
...     f.create_dataset(name="data", data=np.void(video_data))
... 
<HDF5 dataset "data": shape (), type "|V1048061">
>>> with h5py.File("test.h5", "r") as f:
...     data = f["data"][()].tobytes()
...

Alternatively, raw binary data could be stored as a 1-D array of uint8 values, but using dtype uint8, as opposed to OPAQUE, may cause accidental conversion.

@rly rly added category: proposal discussion of proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s) labels May 11, 2024
@rly rly self-assigned this May 11, 2024
@rly
Copy link
Contributor Author

rly commented May 11, 2024

As an HDF5 array of 1-byte void dtypes:

>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
...     f.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
... 
<HDF5 dataset "data": shape (1048061,), type "|V1">
>>> with h5py.File("test.h5", "r") as f:
...     data = f["data"][:].tobytes()
... 

As a scalar Zarr array:

>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.void(video_data))
<zarr.core.Array '/data' () |V1048061>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][()].tobytes()

As a Zarr array of 1-byte void dtypes:

>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
<zarr.core.Array '/data' (1048061,) |V1>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][:].tobytes()

@rly
Copy link
Contributor Author

rly commented May 11, 2024

When the data are written as a scalar Zarr array, the data are stored in a single chunk, and that chunk is equal to just writing the bytes to disk. For some reason, the fill value is set to "AAAAA...." repeatedly and that makes .zarray larger than the chunk itself... -.- .

Alternatively, as shown above, we could store the bytes in a dataset with shape (N,) and dtype V1 would allow for lazy, iterative access that doesn't overload memory on read. The data is chunked in zarr and the fill value is a more reasonable AA==. That's probably better, but I'm not sure if there would be any unexpected performance impacts.

@rly rly added this to the Future milestone Aug 9, 2024
@rly rly modified the milestones: Future, 3.0.0 Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: proposal discussion of proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

1 participant