-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Structured array for manifest #39
Conversation
# TODO we want the path field to contain a variable-length string, but that's not available until numpy 2.0 | ||
# See https://numpy.org/neps/nep-0055-string_dtype.html | ||
MANIFEST_STRUCTURED_ARRAY_DTYPES = np.dtype( | ||
[("path", "<U32"), ("offset", np.int32), ("length", np.int32)] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because file paths can be strings of any length, we really need to be using numpy's new variable-width string dtype here.
Unfortunately it's only coming out with numpy 2.0, and although there is a release candidate for numpy 2.0, it's so new that pandas doesn't support it yet. Xarray has a pandas dependency, so currently we can't actually build an environment that let's us try virtualizarr with the variable-length string dtype yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas just released 2.2.2, which is compatible with the upcoming numpy 2.0 release.
Not sure if that will break any part of xarray that we need for VirtualiZarr, but this might now be close enough to test out variable-length dtypes now.
raise ValueError("Chunk keys do not form a complete grid") | ||
|
||
|
||
def concat_manifests(manifests: List["ChunkManifest"], axis: int) -> "ChunkManifest": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 188-263 are what we get rid of by doing concatenation/stacking via the wrapped structured array.
@@ -14,6 +13,9 @@ | |||
_CHUNK_KEY = rf"^{_INTEGER}+({_SEPARATOR}{_INTEGER})*$" # matches 1 integer, optionally followed by more integers each separated by a separator (i.e. a period) | |||
|
|||
|
|||
ChunkDict = NewType("ChunkDict", dict[ChunkKey, dict[str, Union[str, int]]]) | |||
|
|||
|
|||
class ChunkEntry(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we really need this class anymore
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #39 +/- ##
==========================================
- Coverage 90.18% 87.17% -3.02%
==========================================
Files 14 13 -1
Lines 998 834 -164
==========================================
- Hits 900 727 -173
- Misses 98 107 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
As with this approach array entries with empty strings for paths are interpreted as chunks missing from the zarr array, we could easily solve #22 just by having an |
Superceded by #107 |
Aims to close #33 by using a numpy array with a structured dtype to store the
(path, offset, length)
information for each chunk.EDIT: Should add
broadcast_to
empty_like