Skip to content

Data Schema

Ben Murray edited this page Mar 5, 2021 · 21 revisions

Data schema definition

As CSV files don't provide a first-class way of carrying type information, this must somehow be determined so that it can be stored and processed efficiently. There are several approaches that typically get used to achieve this.

  • Discovering the schema from the dataset
  • Specifying the schema

Discovering the schema from the dataset

This method involves parsing the data to understand what data types may be present. Typically, it involves falling back to the most specialised format that fits the data. This is complicated by a couple of factors though. The first complication is that a csv file may have fields that contain a mixture of empty entries and typed values. In this case, we may want to convert this single field into two fields, one that is strongly typed for the values that are present, and one indicating whether values are present. The second complication is that very large files impose a significant upfront cost during schema discovery, and we wish to avoid that if possible. The third complication is that some string fields contain only a few categorical values, and we need to identify when this is the case and convert the field to integer values with an accompanying key. What we cannot do however, is recover any implicit ordering to the categories.

Specifying the schema

A manual specification of the schema may be more time-consuming, but it allows full control over the resulting fields. A hybrid approach is of course possible, where schema discovery is performed and then manually corrected, but this is not implemented for ExeTera at present.

The Schema

The ExeTera schema is a json file containing a description of each group and the fields that make up the groups.

Top level tags

ExeTera expects two top-level tags. The first indicates that this is an ExeTera schema file, and the second is a schema tag.

{
    "exetera": {
        "version": "1.0.0"
    },
    "schema": {
        ...
    }
}

Note that older versions of the schema can have hystore (the old name for exetera) instead of exetera.

Schema tag

The schema tag contains a set of group tags that indicate the name of each group

{
    ...
    "schema": {
        <table1>: ...,
        <table2>: ...,
        <table_blah>: ...
    }
}

Each entry contains a dictionary conforming to the table schema element.

Table tag

Each table entry describes a single table, the relationships between them, and descriptions for the fields within them.

<a_table>: {
  "primary_keys": ...,
  "foreign_keys": ...,
  "fields": ...
}

Table: primary key tag

Primary keys is a list of fields in the table that together uniquely identify each row:

  • This tag can be absent if the table has no primary key.
  • If there is a single primary key, the tag contains a single string corresponding to the name of the field
  • If the key is a compound key, the tag contains an array which itself contains a list of strings
"primary_keys": <foo>

"primary_keys": [<foo>, <bar>]

Table: foreign keys tag

Foreign keys is a list of fields in the table that act as foreign keys into other tables. Each foreign key can be a foreign key into multiple tables. For each such table, the tag is the name of the table mapping to a si

"foreign_keys": {
    <a_table>: {
        <key1>: <a_table_key1>,
        <key2>: <a_table_key2>
    },
    <another_table> {
        <key1>: <another_table_key1>
    }
}

Fields

The field tag contains field specific formation information. Each field description maps to one or more fields in the resulting dataset. The contents of the field tag depend on the type of field in question.

Field types

The field_type tag can take any of the following values:

  • "string"
  • "fixed_string"
  • "numeric"
  • "categorical"
  • "datetime"
  • "date"

String field

String fields import the incoming data as indexed string fields. They take no additional tags

<a_string_field>: {
    "field_type": "string"
}

Fixed string field

Fixed string fields import the incoming data as fixed-length strings (i.e. strings that can be up to a certain number of characters long). They take an additional length tag indicating how long that should be. Note that the data stored in fixed-string fields is stored as byte arrays and so the length must be the number of bytes that the longest string takes up, rather than the number of characters.

<a_fixed_string_field>: {
    "field_type": "fixed_string",
    "length": 8
}

Numeric field

Numeric fields import the incoming data by converting it to the given type, if such a conversion is possible. Numeric fields may contain the following additional keys:

  • raw_type: if "raw_type" is present, then this is the format that the expected data format. This field is usually not required, and is there for fields that are integer values but encoded as floating point values
  • value_type: "value_type" is always required and must take one of the following values:
    • 'float32', 'float64', 'bool', 'int8', 'uint8', 'int16', 'uint16', 'int32', 'uint32', 'int64'

Numeric fields output two fields, one field that contains the imported values, and a second field that is the name of the first field plus _valid, that is a boolean field containing whether the corresponding numeric field entry is valid or not.

<a_numeric_field>: {
    "field_type": "numeric",
    "value_type": "int32"
}

Categorical field

Categorical fields map between a set of string values that represent categorical choices and integer values to represent them more compactly. Categorical fields always have a "categorical" tag that contains:

  • value_type: the numerical type for the numerical categorical values (typically "int8")
  • strings_to_values: the mapping between strings and their corresponding categorical values
  • out_of_range: an optional tag that indicates that there are non-categorical values mixed in with the categorical values and provides a suffix to add to the field name for the out of range values

Note, that when "out_of_range" is used, a separate, indexed-string field is created for those values. The value -1 is used for out of category values in the main field.

<a_categorical_field>: {
    "field_type": "categorical",
    "categorical": {
        "value_type": "int8",
        "strings_to_values": {
            "": 0,
            "not_applicable": 1,
            "mild": 2,
            "medium": 3,
            "severe": 4
        },
        "out_of_range": "free_text"
    }
}

DateTime field

DateTime fields convert string datetime representations to float64 timestamps and fixed string fields with date resolution (i.e., year, month, day). DateTime fields take optional tag optional which indicates whether the field is expected to have missing entries or not. If this is not present, "optional": "false" is assumed.

<a_datetime_field>: {
    "field_type": "datetime",
    "optional": "False" 
}

DateTime fields are currently limited to importing the following formats:

  • YYYY-MM-DD HH:MM:SS.ffffffz
  • YYYY-MM-DD HH:MM:SSz

DateTime fields create up to two additional fields, along with the float64 timestamp field:

  • <name>_days is a fixed string field that contains the YYYY-MM-DD part of the date time
  • <name>_set is a boolean field indicating whether the datetime is present for a given entry

Date field

The date field essentially identical to the datetime field excepting that it accepts dates in the YYYY-MM-DD format rather than a datetime format. Other than that, it behaves identically to the datetime field.