-
Notifications
You must be signed in to change notification settings - Fork 4
Concepts
This page covers the basic concepts behind the pipeline and the design decisions that have gone into it.
HDF5 is a hierarchic key/value store. This means that it stores pairs of keys and data associated with that key. This is important because a dataset can be very large and the data that you want to perform analysis on can be a very small fraction of that dataset. HDF5 allows you to explore the hierarchic collection of fields without having to load them, and it allows you to load specific fields or even part of a field.
Although we can load part of a field, which allows us to perform some types of processing on arbitrarily large fields, the native performance of HDF5 field iteration is very poor, and so much of the functionality of the pipeline is dedicated towards providing scalability without sacrificing performance.
Fields have another purpose, which is to support useful metadata along with the field data itself, and also to hide the complexity behind storing certain datatypes efficiently
The pipeline has the following datatypes that can be interacted with through Fields
Indexed strings exist to provide a compact format for storing variable length strings in HDF5. Python / HDF5 through h5py
doesn't support efficient string storage and so we convert python strings to indexed strings before storing them, resulting in orders of magnitude smaller representation in some cases.
Indexed strings are composed to two elements, a uint8
'value' array containing the byte data of all the strings concatenated together, and an index array indicating where a given entry starts and ends in the 'value' array.
Example: Take the following string list
['The','quick','brown','fox','jumps','over','the','','lazy','','dog']
This is serialised as follows:
values = [Thequickbrownfoxjumpsoverthelazydog]
index = [0,3,8,13,16,21,25,28,28,32,32,35]
Note that empty strings are stored very efficiently, as they don't require any space in the 'values' array.
UTF8 strings are encoded into byte arrays before being stored. They are decoded back to UTF8 when reconstituted back into strings when read.
Fixed string fields store each entry as a fixed length byte array. Entries cannot be longer than the number of bytes specified. TODO: encoding / decoding and UTF8
Numeric fields are just that, arrays of a given numeric value. Any primitive numeric value is supported, although use of uint64
is discouraged, as this library is heavily reliant on numpy
and numpy
does unexpected things with uint64
values
a = uint64(1)
b = a + 1
print(type(b))
# float64
Categorical fields are fields where only a certain set of values is permitted. The values are stored as an array of uint8
values, and mapped to human readable values through the 'key' field.
Timestamp fields are arrays of float64 posix timestamp values. These can be mapped to and from datetime fields when performing complex operations. The decision to store dates and datetimes this way is primarily one of performance. It is very quick to check whether millions of timestamps are before or after a given point in time by converting that point in time to a posix timestamp and peforming a fast floating point comparison.
Fields don't read any of the field data from storage until the user explicitly requests it. The user does this by performing array dereference on a field's data
property:
r = session.get(dataset['foo'])
rvalues = r.data[:]
This reads the whole of a given field from the dataset.
Fields are written to in one of three ways:
- one or more calls to
write_part
, followed byflush
- a single call to
write
- writing to the data member, if overwriting existing contents but maintaining the field length
w = session.create_numeric(dataset, 'foo', 'int32')
for p in parts_from_somewhere:
w.write_part(p)
w.flush()
When using write
w = session.create_numeric(dataset, 'foo', 'int32')
w.write(data_from_somewhere)
Fields are marked completed upon flush
or write
. This is the last action that is taken when writing, and indicates that the operation was successfully completed.