Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Add section on null_value_counts #8611

Closed
wants to merge 4 commits into from
Closed

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Sep 21, 2023

As mentioned in the community sync, there is some odd behavior on the null-count:

CREATE TABLE ...
TBLPROPERTIES (
    "write.format.default"="orc"
)
          SELECT named_struct('a', 1, 'b', 2) AS some_struct    -- null_value_counts={1: 0, 2: 0, 3: 0}
UNION ALL SELECT named_struct('a', 1, 'b', null) AS some_struct -- null_value_counts={1: 0, 2: 0, 3: 0}
UNION ALL SELECT null AS some_struct                            -- null_value_counts={1: 1, 2: 0, 3: 0}

CREATE TABLE ...
TBLPROPERTIES (
    "write.format.default"="parquet"
)
          SELECT named_struct('a', 1, 'b', 2) AS some_struct    -- null_value_counts={2: 0, 3: 0}
UNION ALL SELECT named_struct('a', 1, 'b', null) AS some_struct -- null_value_counts={2: 0, 3: 1}
UNION ALL SELECT null AS some_struct                            -- null_value_counts={2: 1, 3: 1}

The goal of this PR is to agree on how to count the nulls for nested objects.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with the spec changes, I do think we should reconsider a patch to rescan footers and rewrite iceberg metadata since I think we will basically have tables which have previously written iceberg metrics with null counts that do not follow this spec? Not a deal breaker for me since we don't actively use these metrics.

@singhpk234
Copy link
Contributor

I do think we should reconsider a patch to rescan footers and rewrite iceberg metadata since I think we will basically have tables which have previously written iceberg metrics with null counts that do not follow this spec

+1 on this there could be engines using these stats

format/spec.md Outdated Show resolved Hide resolved
Co-authored-by: Ajantha Bhat <[email protected]>
@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Oct 17, 2023
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 20, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 27, 2024
@@ -434,7 +434,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
| _optional_ | | ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** |
| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) |
| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) |
| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column |
| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column. If the
null value cannot be correctly determined for a column, the field can remain unpopulated. |
Copy link
Contributor

@zhongyujiang zhongyujiang Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is worth adding a note somewhere to remind people that the null count values collected in the manifest for nested columns currently may be inaccurate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes. stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants