You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created some Icebergs with the latest Iceberg/Spark and checked whether the schemas of the generated Manifest files and Manifest Lists are in accordance to the Spec.
This issue also is based on the discussions in this PR and this Slack Thread, where we discussed what the semantics should be, if a field is labeled optional by the spec:
So far, the outcome of the discussion is that if a field is labeled optional for an Iceberg format version, then writers writing that format version should include a column for that field in the Avro file but tag that column optional (i.e., nullable, i.e.,[Null, T] in Avro). They should not just leave out the column.
This Issue contains all deviations from the Spec I could find. All deviations found were in Icebergs created with Spark 3.4.1 using Iceberg 1.3.1 (latest release).
In a format_version=2 Manifest List:
column key_metadata is not written at all, even though the Spec tags it as optional in v2 (and v1).
The fields added_files_count, existing_files_count, deleted_files_count are not named correctly. They have an additional data_ infix. This was already reported separately in Issue 8684, but I include it here as well for the sake of completeness.
In a format_version=1 Manifest File:
column file_ordinal is not written at all, even though the Spec tags it as optional in v1 (it is considered deprecated though)
column sort_columns is not written at all, even though the Spec tags it as optional in v1 (it is considered deprecated though)
[same as v2] column distinct_counts is not written at all, even though the Spec tags it as optional in v1 (and v2).
In a format_version=1 Manifest List:
The position of the column 507 partitions in the manifest_entry struct is different than in the Spec:
In the spec, it is placed behind 514 deleted_rows_count
In the file, it is placed behind 506 deleted_files_count
The field added_snapshot_id has type ["null","long"], but the spec says it is required, so the type should just be "long".
[same as v2] column key_metadata is not written at all, even though the Spec tags it as optional in v2 (and v1).
[same as v2] The fields added_files_count, existing_files_count, deleted_files_count are not named correctly. They have an additional data_ infix. This was already reported separately in Issue 8684, but I include it here as well for the sake of completeness.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Apache Iceberg version
1.3.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
I created some Icebergs with the latest Iceberg/Spark and checked whether the schemas of the generated Manifest files and Manifest Lists are in accordance to the Spec.
This issue also is based on the discussions in this PR and this Slack Thread, where we discussed what the semantics should be, if a field is labeled optional by the spec:
So far, the outcome of the discussion is that if a field is labeled optional for an Iceberg format version, then writers writing that format version should include a column for that field in the Avro file but tag that column optional (i.e., nullable, i.e.,
[Null, T]
in Avro). They should not just leave out the column.This Issue contains all deviations from the Spec I could find. All deviations found were in Icebergs created with Spark 3.4.1 using Iceberg 1.3.1 (latest release).
In a format_version=2 Manifest List:
key_metadata
is not written at all, even though the Spec tags it as optional in v2 (and v1).added_files_count
,existing_files_count
,deleted_files_count
are not named correctly. They have an additional data_ infix. This was already reported separately in Issue 8684, but I include it here as well for the sake of completeness.In a format_version=1 Manifest File:
file_ordinal
is not written at all, even though the Spec tags it as optional in v1 (it is considered deprecated though)sort_columns
is not written at all, even though the Spec tags it as optional in v1 (it is considered deprecated though)distinct_counts
is not written at all, even though the Spec tags it as optional in v1 (and v2).In a format_version=1 Manifest List:
507 partitions
in themanifest_entry
struct is different than in the Spec:514 deleted_rows_count
506 deleted_files_count
added_snapshot_id
has type["null","long"]
, but the spec says it is required, so the type should just be"long"
.key_metadata
is not written at all, even though the Spec tags it as optional in v2 (and v1).added_files_count
,existing_files_count
,deleted_files_count
are not named correctly. They have an additional data_ infix. This was already reported separately in Issue 8684, but I include it here as well for the sake of completeness.The text was updated successfully, but these errors were encountered: