-
Notifications
You must be signed in to change notification settings - Fork 4
Data Schema
As CSV files don't provide a first-class way of carrying type information, this must somehow be determined so that it can be stored and processed efficiently. There are several approaches that typically get used to achieve this.
- Discovering the schema from the dataset
- Specifying the schema
This method involves parsing the data to understand what data types may be present. Typically, it involves falling back to the most specialised format that fits the data. This is complicated by a couple of factors though. The first complication is that a csv file may have fields that contain a mixture of empty entries and typed values. In this case, we may want to convert this single field into two fields, one that is strongly typed for the values that are present, and one indicating whether values are present. The second complication is that very large files impose a significant upfront cost during schema discovery, and we wish to avoid that if possible. The third complication is that some string fields contain only a few categorical values, and we need to identify when this is the case and convert the field to integer values with an accompanying key. What we cannot do however, is recover any implicit ordering to the categories.
A manual specification of the schema may be more time-consuming, but it allows full control over the resulting fields. A hybrid approach is of course possible, where schema discovery is performed and then manually corrected, but this is not implemented for ExeTera at present.
The ExeTera schema is a json file containing a description of each group and the fields that make up the groups.
ExeTera expects two top-level tags. The first indicates that this is an ExeTera schema file, and the second is a schema
tag.
{
"exetera": {
"version": "1.0.0"
},
"schema": {
...
}
}
Note that older versions of the schema can have hystore
(the old name for exetera) instead of exetera
.
The schema tag contains a set of group tags that indicate the name of the group
{
...
"schema": {
"table1": {
...
},
"table2": {
...
},
"table_blah" {
...
}
}
}