Skip to content

Concepts

Ben Murray edited this page Sep 2, 2020 · 10 revisions

This page covers the basic concepts behind the pipeline and the design decisions that have gone into it.

HDF5

HDF5 is a hierarchic key/value store. This means that it stores pairs of keys and data associated with that key. This is important because a dataset can be very large and the data that you want to perform analysis on can be a very small fraction of that dataset. HDF5 allows you to explore the hierarchic collection of fields without having to load them, and it allows you to load specific fields or even part of a field.

Fields

Although we can load part of a field, which allows us to perform some types of processing on arbitrarily large fields, the native performance of HDF5 field iteration is very poor, and so much of the functionality of the pipeline is dedicated towards providing scalability without sacrificing performance.

Fields have another purpose, which is to support useful metadata along with the field data itself, and also to hide the complexity behind storing certain datatypes efficiently

Datatypes

The pipeline has the following datatypes that can be interacted with through Fields

Indexed string

Indexed strings exist to provide a compact format for storing variable length strings in HDF5. Python / HDF5 through h5py doesn't support efficient string storage and so we convert python strings to indexed strings before storing them, resulting in orders of magnitude smaller representation in some cases. Indexed strings are composed to two elements, a uint8 'value' array containing the byte data of all the strings concatenated together, and an index array indicating where a given entry starts and ends in the 'value' array.

Example: Take the following string list

['The','quick','brown','fox','jumps','over','the','','lazy','','dog']

This is serialised as follows:

values = [Thequickbrownfoxjumpsoverthelazydog]
index = [0,3,8,13,16,21,25,28,28,32,32,35]

Note that empty strings are stored very efficiently, as they don't require any space in the 'values' array.

UTF8

UTF8 strings are encoded into byte arrays before being stored. They are decoded back to UTF8 when reconstituted back into strings when read.

Fixed string

Fixed string fields store each entry as a fixed length byte array. Entries cannot be longer than the number of bytes specified. TODO: encoding / decoding and UTF8

Numeric

Numeric fields are just that, arrays of a given numeric value. Any primitive numeric value is supported, although use of uint64 is discouraged, as this library is heavily reliant on numpy and numpy does unexpected things with uint64 values

a = uint64(1)
b = a + 1
print(type(b))
# float64

Categorical

Categorical fields are fields where only a certain set of values is permitted. The values are stored as an array of uint8 values, and mapped to human readable values through the 'key' field.

Timestamp

Timestamp fields are arrays of float64 posix timestamp values. These can be mapped to and from datetime fields when performing complex operations. The decision to store dates and datetimes this way is primarily one of performance. It is very quick to check whether millions of timestamps are before or after a given point in time by converting that point in time to a posix timestamp and peforming a fast floating point comparison.

Operations

Reading from Fields

Fields don't read any of the field data from storage until the user explicitly requests it. The user does this by performing array dereference on a field's data property:

r = session.get(dataset['foo'])
rvalues = r.data[:]

This reads the whole of a given field from the dataset.

Writing to fields

Fields are written to in one of three ways:

  • one or more calls to write_part, followed by flush
  • a single call to write
  • writing to the data member, if overwriting existing contents but maintaining the field length
w = session.create_numeric(dataset, 'foo', 'int32')
for p in parts_from_somewhere:
    w.write_part(p)
w.flush()

When using write

w = session.create_numeric(dataset, 'foo', 'int32')
w.write(data_from_somewhere)

Fields are marked completed upon flush or write. This is the last action that is taken when writing, and indicates that the operation was successfully completed.

Clone this wiki locally