-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Document Snapshot Summary Optional Fields for Standardization #11660
base: main
Are you sure you want to change the base?
Conversation
format/spec.md
Outdated
* `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows. | ||
|
||
##### Optional Metrics | ||
All metrics fields should have numeric string values (e.g., `"120"`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should note that these values may be used by engines for optimizations so they must be correct but they can be skipped if an engine doesn't want to write them. Maybe that's clear enough already though with "optional"
format/spec.md
Outdated
* `overwrite` -- Data and delete files were added and removed in a logical overwrite operation. | ||
* `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows. | ||
|
||
##### Optional Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As on the proposal doc I think this would be a nice appendix table
format/spec.md
Outdated
Some of them are also used to represent partition-level metrics, in [Optional Partition-Level Summary](#optional-partition-level-summary). | ||
|
||
| Field | Description | Used in Partition-Level Summary | | ||
|-------------------------------------|-------------------------------------------------------------------|---------------------------------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "current" is required here
format/spec.md
Outdated
|-------------------------------------|-------------------------------------------------------------------|---------------------------------| | ||
| **`added-data-files`** | Number of data files added in the current snapshot | Yes | | ||
| **`deleted-data-files`** | Number of data files deleted in the current snapshot | Yes | | ||
| **`total-data-files`** | Total number of data files in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total number of live data files?
format/spec.md
Outdated
| **`added-data-files`** | Number of data files added in the current snapshot | Yes | | ||
| **`deleted-data-files`** | Number of data files deleted in the current snapshot | Yes | | ||
| **`total-data-files`** | Total number of data files in the current snapshot | No | | ||
| **`added-delete-files`** | Number of delete files added in the current snapshot | Yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a combo of position and equality deletes
format/spec.md
Outdated
| **`added-dvs`** | Number of deletion vectors added in the current snapshot | Yes | | ||
| **`removed-dvs`** | Number of deletion vectors removed in the current snapshot | Yes | | ||
| **`removed-delete-files`** | Number of delete files removed in the current snapshot | Yes | | ||
| **`total-delete-files`** | Total number of delete files in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a combination of dv, df and eq deletes right?
format/spec.md
Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes | | ||
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes | | ||
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No | | ||
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear where duplicate files are coming from
format/spec.md
Outdated
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes | | ||
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No | | ||
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No | | ||
| **`changed-partition-count`** | Number of partitions changed in the current snapshot | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to detail, partitions with added, deleted or modified files?
This PR introduces a new section, "Snapshot Summary", in the table spec under Snapshots to document optional fields in the snapshot summary, including metrics, partition-level summaries, and other fields such as Write-Audit-Publish (WAP)-related fields and ReplacePartitions indicators. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.
Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
Marked as Draft because this is subject to change based on discussion on dev list