Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Document Snapshot Summary Optional Fields for Standardization #11660

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Nov 26, 2024

This PR introduces a new section, "Snapshot Summary", in the table spec under Snapshots to document optional fields in the snapshot summary, including metrics, partition-level summaries, and other fields such as Write-Audit-Publish (WAP)-related fields and ReplacePartitions indicators. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.

Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing

Marked as Draft because this is subject to change based on discussion on dev list

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Nov 26, 2024
@HonahX HonahX linked an issue Nov 26, 2024 that may be closed by this pull request
6 tasks
format/spec.md Outdated
* `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows.

##### Optional Metrics
All metrics fields should have numeric string values (e.g., `"120"`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should note that these values may be used by engines for optimizations so they must be correct but they can be skipped if an engine doesn't want to write them. Maybe that's clear enough already though with "optional"

format/spec.md Outdated
* `overwrite` -- Data and delete files were added and removed in a logical overwrite operation.
* `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows.

##### Optional Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As on the proposal doc I think this would be a nice appendix table

format/spec.md Outdated
Some of them are also used to represent partition-level metrics, in [Optional Partition-Level Summary](#optional-partition-level-summary).

| Field | Description | Used in Partition-Level Summary |
|-------------------------------------|-------------------------------------------------------------------|---------------------------------|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "current" is required here

format/spec.md Outdated
|-------------------------------------|-------------------------------------------------------------------|---------------------------------|
| **`added-data-files`** | Number of data files added in the current snapshot | Yes |
| **`deleted-data-files`** | Number of data files deleted in the current snapshot | Yes |
| **`total-data-files`** | Total number of data files in the current snapshot | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total number of live data files?

format/spec.md Outdated
| **`added-data-files`** | Number of data files added in the current snapshot | Yes |
| **`deleted-data-files`** | Number of data files deleted in the current snapshot | Yes |
| **`total-data-files`** | Total number of data files in the current snapshot | No |
| **`added-delete-files`** | Number of delete files added in the current snapshot | Yes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a combo of position and equality deletes

format/spec.md Outdated
| **`added-dvs`** | Number of deletion vectors added in the current snapshot | Yes |
| **`removed-dvs`** | Number of deletion vectors removed in the current snapshot | Yes |
| **`removed-delete-files`** | Number of delete files removed in the current snapshot | Yes |
| **`total-delete-files`** | Total number of delete files in the current snapshot | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a combination of dv, df and eq deletes right?

format/spec.md Outdated
| **`added-equality-deletes`** | Number of equality delete records added in the current snapshot | Yes |
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes |
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No |
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear where duplicate files are coming from

format/spec.md Outdated
| **`removed-equality-deletes`** | Number of equality delete records removed in the current snapshot | Yes |
| **`total-equality-deletes`** | Total number of equality delete records in the current snapshot | No |
| **`deleted-duplicate-files`** | Number of duplicate files deleted in the current snapshot | No |
| **`changed-partition-count`** | Number of partitions changed in the current snapshot | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to detail, partitions with added, deleted or modified files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Snapshot Summary Optional Fields for Standardization
2 participants