Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when chunk coordinates are invalid #395

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

emacollins
Copy link

Closes #175

Copy link
Contributor

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. Some comments added.

/// # Returns
///
/// An iterator over the maximum permitted chunk indices.
fn max_chunk_indices_permitted(&self) -> impl Iterator<Item = u64> + '_ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chunk indices are u32:

pub struct ChunkIndices(pub Vec<u32>);

At the beginning of this method, it could be a good place to:

debug_assert_eq!(self.shape.len(), self.chunk_shape.0.len())

or similar.

Copy link
Author

@emacollins emacollins Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new commit to address this. I ended up returning a ChunkIndices type from this using collect() on the iterator and type casting the calculation result to u32. Not sure if there is a performance consideration with using collect prior to returning, but it seemed to make sense that the type returned be ChunkIndices. This way the actual validation function compares two ChunkIndices.

Shape and chunk shape were both u64, and I assume chunk indices are u32 since they will always be much smaller values.

Also added the debug assert, which is causing tests in repository.rs to fail. I noticed this is because the chunk shape initialized is a single value, even when the shape is multiple. Do we assume that if a single chunk shape value is present, it applies to all dimensions? Or should this fail? Example of test data from one of the existing tests.

let zarr_meta = ZarrArrayMetadata {
            shape: vec![1, 1, 2],
            data_type: DataType::Int32,
            chunk_shape: ChunkShape(vec![NonZeroU64::new(2).unwrap()]),
            ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up returning a ChunkIndices

perfect!

I assume chunk indices are u32

yes, you cand valitade looking here

chunk shape initialized is a single value, even when the shape is multiple.

this is probably a bad test, we should fix it to match. We need to add more validations to the metadata to avoid this kind of thing, but it's not critical because zarr also validates.

Do we assume that if a single chunk shape value is present, it applies to all dimensions

No, I don't think so, probably just a bad test, please change it to match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answers! I will return to this PR next week after the holiday.

self.shape
.iter()
.zip(self.chunk_shape.0.iter())
.map(|(s, cs)| ((s + cs.get() - 1)) / cs.get() - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math looks a bit weird to me. Shouldn't it be just s / cs.get() - 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't know why I left this expanded like this. Simplified it down to (s - 1) / cs.get(), which should be equivalent and also seems to handle all cases properly.

I added tests to confirm this, but if there are additional edge cases I haven't thought of, let me know. Will continue to think on it.

/// # Errors
///
/// Returns `IcechunkFormatError::ChunkCoordinatesNotFound` if the chunk coordinates are invalid.
pub fn valid_chunk_coord(&self, coord: &ChunkIndices) -> IcechunkResult<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this function doesn't really fail. Things would be easier if you make it return a simple bool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

I am not sure I am handling the false case in the most elegant way in set_chunk_ref.

Copy link
Contributor

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really great @emacollins

serde_json::Value::from(42),
)))),
}],
storage_transformers: Some(vec![StorageTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can pass None here to make it shorter

serde_json::Value::from(42),
)))),
}]),
dimension_names: Some(vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

let coord2 = ChunkIndices(vec![10, 11, 10]);
let coord3 = ChunkIndices(vec![0, 0 ,0]);

assert_eq!(true, zarr_meta1.valid_chunk_coord(&coord1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace these by assert!(...) or assert!(! ...)

message: "getting an array".to_string(),
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a test for Repository too? that validates that the error is correctly bubbling up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@emacollins emacollins marked this pull request as ready for review November 30, 2024 19:21
Copy link
Contributor

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emacollins Let's just change the error as shown in my comment and we are ready to merge. Thank you!

self.change_set.set_chunk_ref(node_snapshot.id, coord, data);
Ok(())
} else {
Err(RepositoryError::FormatError(IcechunkFormatError::ChunkCoordinatesNotFound { coords: coord }))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that i think about it, this is not really a FormatError but a user error. This deserves a new addition to RepositoryError, something like InvalidIndex or something like that.

@paraseba
Copy link
Contributor

paraseba commented Dec 2, 2024

@emacollins you may also want to rebase your PR so it has no conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't allow to write chunks to invalid chunk coordinates
2 participants