Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Structured array for manifest #39

Closed
wants to merge 10 commits into from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Mar 17, 2024

Aims to close #33 by using a numpy array with a structured dtype to store the (path, offset, length) information for each chunk.

EDIT: Should add

  • broadcast_to
  • empty_like

Comment on lines +49 to +53
# TODO we want the path field to contain a variable-length string, but that's not available until numpy 2.0
# See https://numpy.org/neps/nep-0055-string_dtype.html
MANIFEST_STRUCTURED_ARRAY_DTYPES = np.dtype(
[("path", "<U32"), ("offset", np.int32), ("length", np.int32)]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because file paths can be strings of any length, we really need to be using numpy's new variable-width string dtype here.

Unfortunately it's only coming out with numpy 2.0, and although there is a release candidate for numpy 2.0, it's so new that pandas doesn't support it yet. Xarray has a pandas dependency, so currently we can't actually build an environment that let's us try virtualizarr with the variable-length string dtype yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas just released 2.2.2, which is compatible with the upcoming numpy 2.0 release.

Not sure if that will break any part of xarray that we need for VirtualiZarr, but this might now be close enough to test out variable-length dtypes now.

raise ValueError("Chunk keys do not form a complete grid")


def concat_manifests(manifests: List["ChunkManifest"], axis: int) -> "ChunkManifest":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 188-263 are what we get rid of by doing concatenation/stacking via the wrapped structured array.

@@ -14,6 +13,9 @@
_CHUNK_KEY = rf"^{_INTEGER}+({_SEPARATOR}{_INTEGER})*$" # matches 1 integer, optionally followed by more integers each separated by a separator (i.e. a period)


ChunkDict = NewType("ChunkDict", dict[ChunkKey, dict[str, Union[str, int]]])


class ChunkEntry(BaseModel):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need this class anymore

Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 96.73913% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 87.17%. Comparing base (f226093) to head (385290d).
Report is 15 commits behind head on main.

❗ Current head 385290d differs from pull request most recent head 3354843. Consider uploading reports for the commit 3354843 to get more accurate results

Files Patch % Lines
virtualizarr/manifests/manifest.py 91.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #39      +/-   ##
==========================================
- Coverage   90.18%   87.17%   -3.02%     
==========================================
  Files          14       13       -1     
  Lines         998      834     -164     
==========================================
- Hits          900      727     -173     
- Misses         98      107       +9     
Flag Coverage Δ
unittests 87.17% <96.73%> (-3.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TomNicholas
Copy link
Member Author

As with this approach array entries with empty strings for paths are interpreted as chunks missing from the zarr array, we could easily solve #22 just by having an empty_like creation function that creates an array with only empty paths.

@TomNicholas
Copy link
Member Author

Superceded by #107

@TomNicholas TomNicholas deleted the structured_array_manifest branch May 16, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In-memory representation of chunks: array instead of a dict?
1 participant