Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when chunk coordinates are invalid #395

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
51 changes: 50 additions & 1 deletion icechunk/src/format/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ use crate::metadata::{
use super::{
format_constants, manifest::ManifestRef, AttributesId, IcechunkFormatError,
IcechunkFormatVersion, IcechunkResult, ManifestId, NodeId, ObjectId, Path,
SnapshotId, TableOffset,
SnapshotId, TableOffset, ChunkIndices
};

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
Expand Down Expand Up @@ -49,6 +49,55 @@ pub struct ZarrArrayMetadata {
pub dimension_names: Option<DimensionNames>,
}

impl ZarrArrayMetadata {

/// Returns an iterator over the maximum permitted chunk indices for the array.
///
/// This function calculates the maximum chunk indices based on the shape of the array
/// and the chunk shape.
///
/// # Returns
///
/// An iterator over the maximum permitted chunk indices.
fn max_chunk_indices_permitted(&self) -> impl Iterator<Item = u64> + '_ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chunk indices are u32:

pub struct ChunkIndices(pub Vec<u32>);

At the beginning of this method, it could be a good place to:

debug_assert_eq!(self.shape.len(), self.chunk_shape.0.len())

or similar.

Copy link
Author

@emacollins emacollins Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new commit to address this. I ended up returning a ChunkIndices type from this using collect() on the iterator and type casting the calculation result to u32. Not sure if there is a performance consideration with using collect prior to returning, but it seemed to make sense that the type returned be ChunkIndices. This way the actual validation function compares two ChunkIndices.

Shape and chunk shape were both u64, and I assume chunk indices are u32 since they will always be much smaller values.

Also added the debug assert, which is causing tests in repository.rs to fail. I noticed this is because the chunk shape initialized is a single value, even when the shape is multiple. Do we assume that if a single chunk shape value is present, it applies to all dimensions? Or should this fail? Example of test data from one of the existing tests.

let zarr_meta = ZarrArrayMetadata {
            shape: vec![1, 1, 2],
            data_type: DataType::Int32,
            chunk_shape: ChunkShape(vec![NonZeroU64::new(2).unwrap()]),
            ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up returning a ChunkIndices

perfect!

I assume chunk indices are u32

yes, you cand valitade looking here

chunk shape initialized is a single value, even when the shape is multiple.

this is probably a bad test, we should fix it to match. We need to add more validations to the metadata to avoid this kind of thing, but it's not critical because zarr also validates.

Do we assume that if a single chunk shape value is present, it applies to all dimensions

No, I don't think so, probably just a bad test, please change it to match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answers! I will return to this PR next week after the holiday.

self.shape
.iter()
.zip(self.chunk_shape.0.iter())
.map(|(s, cs)| ((s + cs.get() - 1)) / cs.get() - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math looks a bit weird to me. Shouldn't it be just s / cs.get() - 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't know why I left this expanded like this. Simplified it down to (s - 1) / cs.get(), which should be equivalent and also seems to handle all cases properly.

I added tests to confirm this, but if there are additional edge cases I haven't thought of, let me know. Will continue to think on it.

}

/// Validates the provided chunk coordinates for the array.
///
/// This function checks if the provided chunk indices are valid for the array.
///
/// # Arguments
///
/// * `coord` - The chunk indices to validate.
///
/// # Returns
///
/// An `IcechunkResult` indicating whether the chunk coordinates are valid.
///
/// # Errors
///
/// Returns `IcechunkFormatError::ChunkCoordinatesNotFound` if the chunk coordinates are invalid.
pub fn valid_chunk_coord(&self, coord: &ChunkIndices) -> IcechunkResult<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this function doesn't really fail. Things would be easier if you make it return a simple bool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

I am not sure I am handling the false case in the most elegant way in set_chunk_ref.


let valid: bool = coord
.0
.iter()
.zip(self.max_chunk_indices_permitted())
.all(|(index, index_permitted)| *index <= index_permitted as u32);

if valid {
Ok(true)
} else {
Err(IcechunkFormatError::ChunkCoordinatesNotFound { coords: (coord.clone()) })
}

}
}

#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub enum NodeData {
Array(ZarrArrayMetadata, Vec<ManifestRef>),
Expand Down
17 changes: 14 additions & 3 deletions icechunk/src/repository.rs
Original file line number Diff line number Diff line change
Expand Up @@ -424,9 +424,20 @@ impl Repository {
coord: ChunkIndices,
data: Option<ChunkPayload>,
) -> RepositoryResult<()> {
self.get_array(&path)
.await
.map(|node| self.change_set.set_chunk_ref(node.id, coord, data))

let node_snapshot = self.get_array(&path).await?;

if let NodeData::Array(zarr_metadata, _, ) = node_snapshot.node_data {
zarr_metadata.valid_chunk_coord(&coord)?;
self.change_set.set_chunk_ref(node_snapshot.id, coord, data);
Ok(())
} else {
Err(RepositoryError::NotAnArray {
node: node_snapshot,
message: "getting an array".to_string(),
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a test for Repository too? that validates that the error is correctly bubbling up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

}

pub async fn get_node(&self, path: &Path) -> RepositoryResult<NodeSnapshot> {
Expand Down