[WIP] Structured array for manifest #39

TomNicholas · 2024-03-17T22:28:36Z

Aims to close #33 by using a numpy array with a structured dtype to store the (path, offset, length) information for each chunk.

EDIT: Should add

broadcast_to
empty_like

TomNicholas · 2024-03-18T15:28:53Z

virtualizarr/manifests/manifest.py

+# TODO we want the path field to contain a variable-length string, but that's not available until numpy 2.0
+# See https://numpy.org/neps/nep-0055-string_dtype.html
+MANIFEST_STRUCTURED_ARRAY_DTYPES = np.dtype(
+    [("path", "<U32"), ("offset", np.int32), ("length", np.int32)]
+)


Because file paths can be strings of any length, we really need to be using numpy's new variable-width string dtype here.

Unfortunately it's only coming out with numpy 2.0, and although there is a release candidate for numpy 2.0, it's so new that pandas doesn't support it yet. Xarray has a pandas dependency, so currently we can't actually build an environment that let's us try virtualizarr with the variable-length string dtype yet.

Pandas just released 2.2.2, which is compatible with the upcoming numpy 2.0 release.

Not sure if that will break any part of xarray that we need for VirtualiZarr, but this might now be close enough to test out variable-length dtypes now.

…uctured array

TomNicholas · 2024-03-18T16:31:55Z

virtualizarr/manifests/manifest.py

-        raise ValueError("Chunk keys do not form a complete grid")
-
-
-def concat_manifests(manifests: List["ChunkManifest"], axis: int) -> "ChunkManifest":


Lines 188-263 are what we get rid of by doing concatenation/stacking via the wrapped structured array.

TomNicholas · 2024-03-18T16:43:42Z

virtualizarr/manifests/manifest.py

@@ -14,6 +13,9 @@
 _CHUNK_KEY = rf"^{_INTEGER}+({_SEPARATOR}{_INTEGER})*$"  # matches 1 integer, optionally followed by more integers each separated by a separator (i.e. a period)


+ChunkDict = NewType("ChunkDict", dict[ChunkKey, dict[str, Union[str, int]]])
+
+
 class ChunkEntry(BaseModel):


Not sure we really need this class anymore

codecov · 2024-03-18T19:20:45Z

Codecov Report

Attention: Patch coverage is 96.73913% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 87.17%. Comparing base (f226093) to head (385290d).
Report is 15 commits behind head on main.

❗ Current head 385290d differs from pull request most recent head 3354843. Consider uploading reports for the commit 3354843 to get more accurate results

Files	Patch %	Lines
virtualizarr/manifests/manifest.py	91.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #39      +/-   ##
==========================================
- Coverage   90.18%   87.17%   -3.02%     
==========================================
  Files          14       13       -1     
  Lines         998      834     -164     
==========================================
- Hits          900      727     -173     
- Misses         98      107       +9

Flag	Coverage Δ
unittests	`87.17% <96.73%> (-3.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TomNicholas · 2024-03-19T14:23:07Z

As with this approach array entries with empty strings for paths are interpreted as chunks missing from the zarr array, we could easily solve #22 just by having an empty_like creation function that creates an array with only empty paths.

for more information, see https://pre-commit.ci

TomNicholas · 2024-05-10T16:55:47Z

Superceded by #107

TomNicholas added 3 commits March 17, 2024 16:13

change entries property to a structured array, add from_dict

0c445fd

fix validation

3bc483f

equals method

20f2ded

TomNicholas commented Mar 18, 2024

View reviewed changes

re-implemented concatenation through concatenation of the wrapped str…

be8af12

…uctured array

TomNicholas commented Mar 18, 2024

View reviewed changes

TomNicholas added 2 commits March 18, 2024 12:50

fixed manifest.from_kerchunk_dict

bd8ad22

fixed kerchunk tests

385290d

TomNicholas added the enhancement New feature or request label Apr 3, 2024

This was referenced Apr 4, 2024

Writing to parquet (following kerchunk format) #72

Closed

Inline loaded variables into kerchunk references #73

Merged

TomNicholas added 2 commits April 11, 2024 15:08

Merge branch 'main' into structured_array_manifest

309019a

Merge branch 'main' into structured_array_manifest

4132b32

This was referenced May 3, 2024

Manifest storage transformer zarr-developers/zarr-specs#287

Open

Performance roadmap #104

Open

In-memory representation of chunks: array instead of a dict? #33

Closed

TomNicholas and others added 2 commits May 9, 2024 16:10

Merge branch 'main' into structured_array_manifest

830dccc

[pre-commit.ci] auto fixes from pre-commit.com hooks

3354843

for more information, see https://pre-commit.ci

TomNicholas mentioned this pull request May 10, 2024

Use 3 numpy arrays for manifest internally #107

Merged

TomNicholas closed this May 10, 2024

TomNicholas deleted the structured_array_manifest branch May 16, 2024 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Structured array for manifest #39

[WIP] Structured array for manifest #39

TomNicholas commented Mar 17, 2024 •

edited

Loading

TomNicholas Mar 18, 2024

TomNicholas Apr 11, 2024

TomNicholas Mar 18, 2024

TomNicholas Mar 18, 2024

codecov bot commented Mar 18, 2024 •

edited

Loading

TomNicholas commented Mar 19, 2024

TomNicholas commented May 10, 2024

		raise ValueError("Chunk keys do not form a complete grid")


		def concat_manifests(manifests: List["ChunkManifest"], axis: int) -> "ChunkManifest":

[WIP] Structured array for manifest #39

[WIP] Structured array for manifest #39

Conversation

TomNicholas commented Mar 17, 2024 • edited Loading

TomNicholas Mar 18, 2024

Choose a reason for hiding this comment

TomNicholas Apr 11, 2024

Choose a reason for hiding this comment

TomNicholas Mar 18, 2024

Choose a reason for hiding this comment

TomNicholas Mar 18, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

TomNicholas commented Mar 19, 2024

TomNicholas commented May 10, 2024

TomNicholas commented Mar 17, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading