Skip to content

Commit

Permalink
Merge pull request #2 from jiayuasu/geo_spec_draft
Browse files Browse the repository at this point in the history
Add more explanation
  • Loading branch information
szehon-ho authored Dec 4, 2024
2 parents 4c4a314 + cdd6eb2 commit b7f8a33
Showing 1 changed file with 7 additions and 4 deletions.
11 changes: 7 additions & 4 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ Notes:
1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical).
3. Character strings must be stored as UTF-8 encoded byte arrays.
4. CRS (coordinate reference system) is a mapping of how coordinates refer to locations on earth. A custom crs can be specified by a string, which is a table property whose value is the crs representation, and with an additional '.type' suffix is optionally another table property whose value describes the representation's encoding. If this field is null (no custom CRS provided), CRS defaults to OGC:CRS84, which means the data must be stored in longitude, latitude based on the WGS84 datum. Fixed and cannot be changed by schema evolution.
4. CRS (coordinate reference system) is a mapping of how coordinates refer to locations on earth. A custom crs can be specified by a string, which is a table property whose value is the crs representation, and with an additional '.type' suffix is optionally another table property whose value describes the representation's encoding. If this field is null (no custom CRS provided), CRS defaults to OGC:CRS84. Fixed and cannot be changed by schema evolution.

For details on how to serialize a schema to JSON, see Appendix C.

Expand Down Expand Up @@ -586,8 +586,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
| _optional_ | _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column |
| _optional_ | _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column |
| _optional_ | _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts |
| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For `geometry` types, see [7]. |
| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] For `geometry` type, see [8] |
| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For `geometry` types, see [7,9]. |
| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] For `geometry` type, see [8,9] |
| _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption |
| _optional_ | _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending |
| | _optional_ | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
Expand All @@ -607,6 +607,7 @@ Notes:
6. The following field ids are reserved on `data_file`: 141.
7. `geometry`, this is a point: X = westernmost bound of all geometries in file, Y = northernmost bound of all geometries in file, Z is min value for all component points of all geometries in the file, M is min value of all component points of all geometries in the file. See Appendix D for encoding.
8. `geometry`, this is a point: X = easternmost bound of all geometries in file, Y = southernmost bound of all geometries in file, Z is max value for all component points of all geometries in the file, M is max value of all component points of all geometries in the file. See Appendix D for encoding.
9. `geometry`, the concepts of westernmost and easternmost values are explicitly introduced to address cases involving antimeridian crossing, where the `lower_bound` may be greater than `upper_bound`.

The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.

Expand Down Expand Up @@ -1642,6 +1643,8 @@ When processing point in time queries implementations should use "snapshot-log"

## Appendix G: Geospatial Notes

The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for Geographic information – Simple feature access – Part 1: Common architecture](https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial Consortium)](https://www.ogc.org/standard/sfa/).
The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for Geographic information – Simple feature access – Part 1: Common architecture](https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial Consortium)](https://www.ogc.org/standard/sfa/). According to the OGC specification, all geometric attributes are described piecewise by straight line or planar interpolation between sets of points.

The version of the OGC standard first used here is 1.2.1, but future versions may also used if the WKB representation remains wire-compatible.

Coordinate axis order is always (x, y) where x is easting or longitude, and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS.

0 comments on commit b7f8a33

Please sign in to comment.