-
Notifications
You must be signed in to change notification settings - Fork 68
Nutigeodb format
Nutigeodb files are used for offline geocoding and reverse geocoding. At the very high level they are Sqlite database files with carefully designed structure to provide reasonably fast geocoding performance and compact file sizes.
Nutigeodb database can store various geographic entities, including countries, states, streets, addresses. Each entity may have multiple entity tags associated with it:
Type | Id |
---|---|
Country | 1 |
Region | 2 |
County | 3 |
Locality | 4 |
Neighbourhood | 5 |
Street | 6 |
Postcode | 7 |
Name | 8 |
Housenumber | 9 |
This section lists the tables used by the format.
Metadata table is used to store both metadata and conversion or storage related information. The table contains two fields:
Field | Type |
---|---|
name | TEXT |
value | TEXT |
The following table lists the supported 'name' values:
Name | Description |
---|---|
version | Version of the database, currently 1 |
rank_scale | The scale of the entity ranks, described below |
translation_table | Token translation table, comma separated key value paris ('A:a,B:b') |
bounds | WGS84 bounds encoded as MIN_LON,MIN_LAT,MAX_LON,MAX_LAT |
origin | WGS84 origin point for geometry (encoded as LON,LAT), geometry is stored relative to this |
encoding_precision | The multiplier used when storing coordinates |
quadindex_level | The last zoom level stored in the quadindex. Described below in more detail. |
Entities table stores geographic entities, including addresses, streets and so on. It contains the following fields:
Field | Type | Description |
---|---|---|
id | INTEGER | Unique id for the entity (note: not OSM_id!) |
type | INTEGER | Type (1=country, 2=region, 3=country, 4=neighbourhood, 5=street, 8=POI, 9=address) |
features | BLOB | List of features (ids, geometry) encoded with custom TinyWKB-like encoder, described below |
housenumbers | BLOB | NULL for non-addresses, String with | as a separator for addresses |
quadindex | INTEGER | Special quadtree node id for reverse geocoding. Described below. |
rank | INTEGER | relative rank (importance) of the entity. The scale is stored as 'rank_scale' in metadata table |
One database row may include multiple addresses. In that case the number of features and housenumbers must be equal.
The names of entity components are stored in entitynames table, which contains pairs of (entity_id, name_id) values.
The following table structure is used to store all entity names (and localized versions):
Field | Type | Description |
---|---|---|
id | INTEGER | Unique id for the name |
lang | TEXT | Locality/language of the name, can be NULL |
name | TEXT | Name string |
type | INTEGER | Entity types this name refers to |
The relation between tokens (defined next) and names is stored in nametokens table, which contains (name_id, token_id) pairs.
Tokens are sequences of characters used for resolving names. Tokens are normalized (converted to lower case, with diacritical marks dropped) and do not contain any separators or whitespace symbols. The following table structure is used for tokens:
Field | Type | Description |
---|---|---|
id | INTEGER | Unique id for the token |
token | TEXT | Token string value |
idf | REAL | Token Inverse Document Frequency |
typemask | INTEGER | The bitmask of entity types this token refers to |
Here idf field is calculated as log(totalTokenCount / thisTokenCount) where totalTokenCount is the count of all tokens in all names and thisTokenCount counts the number of occurences of this token in all names. Note that the token stored is the normalized token (converted to lowercase and symbols translated according to translation_table stored in metadata).
Features (geometry with optional metadata, similar to GeoJSON) are encoded as bytestreams with 128-bit varint encoding similar to Google protobuf. Unicode strings are converted to UTF-8 bytestrings and then encoded with explicit UTF-8 length stored as a varint and then followed with the UTF-8 bytes.
Feature collections are encoded as follows:
Field | Type | Description |
---|---|---|
n | varint | Number of features |
features | Feature*n | List of features |
Each feature is encoded as follows:
Field | Type | Description |
---|---|---|
id | varint | Delta-encoded relative to previous id |
geometry | Geometry | Geometry of the feature |
n | varint | Number of properties |
properties | Property*n | List of properties |
Geometry is encoded as follows:
Field | Type | Description |
---|---|---|
type | varint | 1=Point, 2=MultiPoint, 3=LineString, 4=MultiLineString, 5=Polygon, 6=MultiPolygon, 7=Collection |
coords/rings | ... | Encoding depends on type |
The encoding of coordinates or rings is 'natural' - for list types, first a number of elements is stored as a varint, following a list of subtypes. All coordinates are stored as integers by first multiplying each component with the value of encoding_precision (which is stored in metadata table) and then delta encoded relative to the previous coordinate.
Properties are encoded as pairs of (name, value), where name is a string and each value are encoded as follows:
Field | Type | Description |
---|---|---|
type | varint | 0=Null, 1=Bool, 2=Int, 3=Float, 4=String |
data | ... | Encoding depends on type |
Booleans are stored as varints containing either 0 or 1, Ints are stored as varints. Floats are stored using big-endian 32-bit IEEE 754 encoding,
Geocoding database stores compact 64-bit spatial index for fast reverse geocoding requests. The space is represented internally as a quadtree up to a fixed level (stored as quadindex_level in metadata). Each quadtree node is encoded/represented as 64-bit integer as follows:
(((y << zoom) | x) << 5) | zoom
Each geometry is assigned to the smallest node that fully covers the geometry. To query nearest geometry given a location point we first need to find the smallest node that contains the location point and then move upwards to the parent node until we reach the root node. At each encountered node we query the database using the quadindex of the node.