From 4aabc01aea0307a25d6921edfffcf1e7c1188c54 Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Thu, 21 Sep 2023 17:13:27 +0200 Subject: [PATCH 1/4] Spec: Add section on `null_value_counts` --- format/spec.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/format/spec.md b/format/spec.md index 01903393f88f..d503af0904b6 100644 --- a/format/spec.md +++ b/format/spec.md @@ -450,6 +450,48 @@ Notes: 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. 4. The following field ids are reserved on `data_file`: 141. +5. For nested structures, the null counts are as following: + ##### Struct + ``` + schema { + 1: nested_struct<2: int, 3: boolean> + } + ``` + The following holds true: + ``` + null null_value_counts={1: 1, 2: 0, 3: 0} + struct<1, True> null_value_counts={1: 0, 2: 1, 3: 0} + struct<1, null> null_value_counts={1: 0, 2: 1, 3: 1} + ``` + ##### List + ``` + schema { + 1: list[2: int] + } + ``` + The following holds true: + ``` + null null_value_counts={1: 1, 2: 0} + [1, 2, 3] null_value_counts={1: 0, 2: 0} + [1, null, 3] null_value_counts={1: 0, 2: 1} + [null, null, 3] null_value_counts={1: 0, 2: 2} + ``` + ##### Maps + ``` + schema { + 1: map<2: int, 3: bytes> + } + ``` + The following holds true: + ``` + null null_value_counts={1: 1, 2: 0, 3: 0} + {1: b'', 2: b''} null_value_counts={1: 0, 2: 0, 3: 0} + {1: b'', 2: null} null_value_counts={1: 0, 2: 0, 3: 1} + {1: null, 2: null} null_value_counts={1: 0, 2: 0, 3: 2} + ``` + Map keys can't be null. + + The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. From fb8ed0c698af1ce5bf7c6aa6cf7b38f606c690dd Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Thu, 21 Sep 2023 18:06:16 +0200 Subject: [PATCH 2/4] Comments from Russel --- format/spec.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/format/spec.md b/format/spec.md index d503af0904b6..faea3a9a470a 100644 --- a/format/spec.md +++ b/format/spec.md @@ -452,6 +452,7 @@ Notes: 4. The following field ids are reserved on `data_file`: 141. 5. For nested structures, the null counts are as following: ##### Struct + Counts are only for explicit nulls in a field. A nested field which is not counted as null if the parent is null. ``` schema { 1: nested_struct<2: int, 3: boolean> @@ -460,10 +461,11 @@ Notes: The following holds true: ``` null null_value_counts={1: 1, 2: 0, 3: 0} - struct<1, True> null_value_counts={1: 0, 2: 1, 3: 0} - struct<1, null> null_value_counts={1: 0, 2: 1, 3: 1} + struct<1, True> null_value_counts={1: 0, 2: 0, 3: 0} + struct<1, null> null_value_counts={1: 0, 2: 0, 3: 1} ``` ##### List + For list types, the number of null elements in the list is counted. If the elements are not counted if the parent is null. ``` schema { 1: list[2: int] @@ -477,6 +479,7 @@ Notes: [null, null, 3] null_value_counts={1: 0, 2: 2} ``` ##### Maps + For map-elements the number of null values is counted within the map. The values are not counted if the parent is null. Keep in mind that map keys can't be null, so the field will be zero. ``` schema { 1: map<2: int, 3: bytes> @@ -489,9 +492,6 @@ Notes: {1: b'', 2: null} null_value_counts={1: 0, 2: 0, 3: 1} {1: null, 2: null} null_value_counts={1: 0, 2: 0, 3: 2} ``` - Map keys can't be null. - - The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. From b9749f7b78be0b36c2b39bfcd31692393cd9f4ef Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Thu, 21 Sep 2023 18:23:49 +0200 Subject: [PATCH 3/4] Add note on missing field-ids --- format/spec.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index faea3a9a470a..b9e9177b9fdb 100644 --- a/format/spec.md +++ b/format/spec.md @@ -434,7 +434,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | _optional_ | | ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | | _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | | _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | +| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column. If the +null value cannot be correctly determined for a column, the field can remain unpopulated. | | _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | | _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | From e5068c3a2d719e27e87b58a13da241e8512a8f9a Mon Sep 17 00:00:00 2001 From: Fokko Driesprong Date: Tue, 17 Oct 2023 20:55:34 +0200 Subject: [PATCH 4/4] Clarify language Co-authored-by: Ajantha Bhat --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index b9e9177b9fdb..cd23d49efe51 100644 --- a/format/spec.md +++ b/format/spec.md @@ -480,7 +480,7 @@ Notes: [null, null, 3] null_value_counts={1: 0, 2: 2} ``` ##### Maps - For map-elements the number of null values is counted within the map. The values are not counted if the parent is null. Keep in mind that map keys can't be null, so the field will be zero. + For map-elements the number of null values is counted within the map. The values are not counted if the parent is null. Keep in mind that map keys can't be null, so the number of null values will always be zero. ``` schema { 1: map<2: int, 3: bytes>