Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow to write chunks to invalid chunk coordinates #175

Open
paraseba opened this issue Oct 10, 2024 · 10 comments · May be fixed by #395
Open

Don't allow to write chunks to invalid chunk coordinates #175

paraseba opened this issue Oct 10, 2024 · 10 comments · May be fixed by #395
Labels
good first issue 🐣 Good for newcomers

Comments

@paraseba
Copy link
Contributor

No description provided.

@paraseba
Copy link
Contributor Author

@jhamman do you think implementing this could cause any kind of problem with zarr-python? For example, this would demand that an array (metadata) is resized before it's new chunks are set.

@paraseba paraseba added the good first issue 🐣 Good for newcomers label Nov 5, 2024
@emacollins
Copy link

@paraseba Would like to try and give this one a shot! Let me know any context and background.

@paraseba
Copy link
Contributor Author

paraseba commented Nov 6, 2024

@emacollins Great!

So, the key to this one is in the set_chunk_ref method of the Repository impl. In there we get the Array and apply the requested change. After getting the array, but before applying the change we would need to check if the passed coord: ChunkIndices are valid for the array. If they are not, we should fail.

To know if the coordinates are valid, we need to look into the array metadata, which you can find in the response of Repository.get_array.

You'll probably need to add a new error case to RepositoryError.

Probably some of our tests fail after this change, and also, we'll want to add tests that verify the new behavior.

Let me know if that's enough to get you started or if you want more context. And ping me with any questions.

Thank you!

@emacollins
Copy link

@paraseba Thanks for the detailed explanation! I think this is enough to get started. I'll familiarize myself about with each of these parts and see what I can do

@paraseba
Copy link
Contributor Author

paraseba commented Nov 7, 2024

Sounds great @emacollins ! Take you time and let me know if you have questions.

@paraseba
Copy link
Contributor Author

paraseba commented Nov 8, 2024

Another change we would need, but it could be done in a separate ticket, would be to make sure that if the user shrinks an array by changing its metadata, all the chunks that are outside of the array bounds disappear from the manifest.

@emacollins
Copy link

@paraseba

Alright, it doesn't look like I am able to publish my branch for this (permissions error?), but just want to check in and make sure I am on the right track. My thinking was to add a method to ZarrArrayMetadata that checks if a given set of coordinates is valid for the array. This could then be called in set_chunk_ref to perform the check before writing. Let me know if this makes sense or if I am completely off track here. Still getting up to speed on Zarr as well as Icechunk.

pub struct ZarrArrayMetadata {
    pub shape: ArrayShape,
    pub data_type: DataType,
    pub chunk_shape: ChunkShape,
    pub chunk_key_encoding: ChunkKeyEncoding,
    pub fill_value: FillValue,
    pub codecs: Vec<Codec>,
    pub storage_transformers: Option<Vec<StorageTransformer>>,
    pub dimension_names: Option<DimensionNames>,
}

impl ZarrArrayMetadata {
    pub fn chunk_indicies_valid(&self, chunk_indices: &ChunkIndices) -> bool {

        // Check if provided chunk indices are valid for array
        // For example, given an array with shape (10000, 10000) 
        // and chunk shape (1000, 1000) there will be 100 chunks 
        // laid out in a 10 by 10 grid. The chunk with indices (0, 0) 
        // provides data for rows 0-999 and columns 0-999 and is stored 
        // under the key “0.0”; the chunk with indices (2, 4) provides data 
        // for rows 2000-2999 and columns 4000-4999 and is stored under the 
        //key “2.4”; etc.

        assert_eq!(self.shape.len(), self.chunk_shape.len());

        let mut num_chunks: u64;
        let mut max_chunk_index: u64;
        let mut requested_chunk_index: &u32;
        let mut chunk_shape: &NonZeroU64;
        for i in 0..self.shape.len() {
            chunk_shape = self.chunk_shape.get(i).unwrap();
            num_chunks = (self.shape[i] +  chunk_shape - 1) / chunk_shape;
            max_chunk_index = num_chunks - 1;
            requested_chunk_index = chunk_indices.get(i).unwrap();
            if max_chunk_index > *requested_chunk_index as u64 {
                continue;
            } else {
                return false;
            }

        }
        return true;


    }
}

@paraseba
Copy link
Contributor Author

Great work @emacollins . My notes:

Alright, it doesn't look like I am able to publish my branch for this (permissions error?)

You need to fork the repository, push a new branch there, and then create a pull request once it's ready. I can also review the fork branch before you want to create a PR.

add a method to ZarrArrayMetadata that checks if a given set of coordinates is valid for the array.

Great

Let me know if this makes sense

Yes, very much so. That sound like a fine approach.

A few more notes:

  • I'd probably name the method something like valid_chunk_coordinates
  • Rust has a mechanism for documentation strings, google around and you'll find a whole syntax using stuff like ///
  • You could use debug_assert_eq instead. Maybe we should add this validation when metadata is written and updated too, to make sure we fail early.
  • With a quick look your code seems to do the right thing, it's just not idiomatic (as expected as you learn the language). In Rust we usually try to avoid mutable variables, unless they are really necessary. We also try to avoid explicit index based looping. Take a look at the Iterator trait, it has a bunch of very useful methods that get used all the time for this kind of thing.
  • Here are some more tips:
    • You could write a helper function that returns an iterator to the maximum chunk index permitted (or even a Vec would work, but an Iterator may be better practice)
    • Then you can zip the results of that function with the request.
    • Then you verify that all elements in the zip result are valid.
    • Relevant functions: zip, all

@emacollins
Copy link

@paraseba Great, thank you very much for all the detail! I really wanted to make sure I got the core logic of what was needed down first. Now time to clean up and make sure this is implemented idiomatically :)

@emacollins emacollins linked a pull request Nov 15, 2024 that will close this issue
@emacollins
Copy link

Created a draft pr with my progress. Please tear it apart (when you have the time, no rush!) to help me learn. @paraseba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue 🐣 Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants