Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSStore dimension separator bug workaround #43

Closed
wants to merge 0 commits into from

Conversation

sanath-2024
Copy link
Contributor

Patch to Avoid FSStore Bug

The Bug

Due to "overly aggressive canonicalization" in Zarr's FSStore implementation, all periods ('.') in a logical storage path are replaced with slashes ('/') when the dimension separator is a slash. One major consequence of this bug is that the .zmetadata file is implicitly renamed to /zmetadata while writing and searching. This has no observable effects on the local filesystem on Linux, since the filesystem concatenates double slashes into a single slash. However, when the object is stored on a cloud object store (e.g. GCS), the double slash is not ignored. Additionally, even on the local filesystem, the concatenation of double slashes is technically an implementation-defined behavior and is therefore not safe. So, it makes sense to avoid the explicit or implicit use of double slashes.

The Fix

The simplest way to avoid this bug is to avoid using '.' or '/' characters in the path to the metadata file. This change will not affect the ability to read existing MDIO files on the local FS, since the double slash was eliminated by the filesystem anyway. However, it will allow existing MDIO files to be properly read from object stores as well.

@tasansal tasansal changed the title avoid FSStore bug FSStore dimension separtor bug workaround Sep 3, 2022
@tasansal
Copy link
Collaborator

tasansal commented Sep 3, 2022

Thanks @sanath-2024

This should fix the problem I think but I need to run some manual tests since we don't have automated mock object store tests yet.

There is also another potential solution that may be worth looking into, and would be cleaner if it works:

  1. Leave FSStore as is (no separator parameterization)
  2. Only apply the slash separator to the arrays

This way we don't break the self describing nature of Zarr.

I've also mentioned this issue here:

zarr-developers/zarr-python#720 (comment)

@sanath-2024
Copy link
Contributor Author

I think your solution is cleaner. Note that if the FSStore doesn't have a dimension separator it will default back to ".", making the filename ".zmetadata" - on the one hand this is cleaner and it is definitely what's supposed to happen; on the other hand it breaks the existing files that are called "zmetadata" or "/zmetadata".

@srib srib changed the title FSStore dimension separtor bug workaround FSStore dimension separator bug workaround Sep 3, 2022
@tasansal
Copy link
Collaborator

tasansal commented Sep 4, 2022

Created this issue with Zarr:

zarr-developers/zarr-python#1121

In the meantime, @sanath-2024; this is what needs to be done:
(I already tested this works on toy data, but will test again after merge on real data)

  1. Revert your changes
  2. Remove every usage of dimension_separator
  3. Use dimension_separator ONLY where we create arrays

These are the usages I could find, please take a deeper look.

Let me know if you don't have time to work on this, I can make changes and commit here too.

Please don't forget to run pre-commit on your code before you submit, I noticed the pre-commit action is failing.

In tests:

zarr_root["metadata"].create_dataset(
data=grid.live_mask,
name="live_mask",
shape=grid.shape[:-1],
chunks=-1,
)

data_arr = data_grp.create_dataset("chunked_012", data=mock_data)

metadata_grp.create_dataset(
data=il_grid * xl_grid,
name="_".join(["chunked_012", "trace_headers"]),
shape=grid.shape[:-1], # Same spatial shape as data
chunks=data_arr.chunks[:-1], # Same spatial chunks as data
)

In SEG-Y conversion:

trace_array = data_root.create_dataset(
name=name,
shape=grid.shape,
compressor=trace_compressor,
chunks=chunks,
**kwargs,
)

header_array = metadata_root.create_dataset(
name="_".join([name, "trace_headers"]),
shape=grid.shape[:-1], # Same spatial shape as data
chunks=chunks[:-1], # Same spatial chunks as data
compressor=header_compressor,
dtype=header_dtype,
)

zarr_root["metadata"].create_dataset(
data=grid.live_mask,
name="live_mask",
shape=grid.shape[:-1],
chunks=-1,
)

@tasansal tasansal self-requested a review September 4, 2022 16:13
@tasansal tasansal added the bug Something isn't working label Sep 4, 2022
@tasansal
Copy link
Collaborator

tasansal commented Sep 4, 2022

closes #42

This PR will eliminate and backwards compatibility issues until Zarr v3 or we make major changes

@sanath-2024
Copy link
Contributor Author

I can make those changes, and I'll make sure to run pre-commit next time. I will keep you updated about my progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants