Skip to content

Refactors

Ben Murray edited this page Jun 22, 2021 · 8 revisions

Refactoring ExeTera

Out of date page!!!

This page is gradually becoming out of date, and the reader should refer to Roadmap instead

A number of high-level, big impact changes need to be made to the ExeTera codebase. These are required to make ExeTera API more generalisable and easier to use, but allow to allow for upcoming changes to be carried out with less impact, both in terms of code changes and in terms of changes to the API.

The following major refactors are planned:

  • Remove all covid-specific code from ExeTera into ExeTeraCovid
  • Move all data access inside the Session object
  • Provide rich functionality to groups and fields
  • Move ExeTera away from HDF5 as a storage mechanism
    • Provide file-system-based datastore
    • Provide server-based datastore

Replacing hdf5 as a Data Store

""" This is the DataStore intended to replace the usage of the hdf5 datastore.

  • Data is stored in numpy npy/npz files.
  • Encodings are stored with data.
  • Metadata is stored in a json file at the top-level folder

DataStore metadata schema

  • options
    • centralised json document
      • store all metadata in the same document. This requires that any writes to a field are updated in the central metadata document before being considered complete
    • decentralised json fragments
      • store each metadata item with each field. This requires that all metadata is gathered from the datastore directories as part of loading

DataStore field to directory mapping

  • options
    • direct mapping with folders
      • folders represent tables at the top level
    • indirect mapping with folders
      • json metadata schema includes mapping from field to logical location; fields stored to optimise file system usage

DataStore concurrency

In general, the DataStore is not designed to be written to (or read from) by multiple users. There is no system being proposed to allow multiple users to treat it like a concurrent access database, although some provision is made for the notion of coordinated reads / writes from multiple threads.

  • options
    • file-based locking:
      • use a file-based locking mechanism to "lock" datasets for writing. This approach may have issues as file locks are os-specific, although some libraries, such as https://pypi.org/project/filelock/ exist to facilitate this, there is no 'standard' library in the ecosystem for it
    • cross-process synchronisation:
      • use a library based on OS-synchronisation primitives
        • asyncio
        • multiprocessing
        • dask.distributed (uses asyncio)

Initial design thinking

Concurrency: initially don't sync on concurrency; assume one process is reading from / writing to data.

Logical <-> directory mapping: this can be implemented immediately, or it can be left for the initial implementation. The primary concern should be ease of upgrade path. The absence of a logical mapping tag can be taken as meaning direct mapping, or an identity logical mapping tag can be provided. The latter is suggested as the best approach.

Metadata schema: a centralised schema can be tried in the initial instance. Decentralising the schema subsequently should be a simple operation if it is required

The primary consideration is how to keep serialization up to date. Changes to group contents should always be reflected in the data store schema file, whether the schema file elements are scattered or centralised. For scattered elements this is relatively easy, as it can be written to the appropriate location at the appropriate point time (i.e. when a field is created). When centralised, it is more complicated, as the schema serialisation must either be constantly updated and saved, or it must be constantly updated but saved before leaving exception handling, or it must be created and serialised before leaving exception handling. All of these things can be done, but require use of the signal module to do properly.

"""

Provide rich functionality to groups and fields

Problem examples with hypothetical improvements

Apply filter to field

From

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
f = s.get(src["table/field"])
s.apply_filter(flt, f)

To

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
f = src["table/field"]
f.apply_filter(flt)

or

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
src["table/field"].apply_filter(flt)

Apply filter to table/group

From

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
for k, v in src['table'].items():
    s.apply_filter(flt, s.get(v))

To

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
tab = src['table']
tab.apply_filter(flt)

or

s = Session()
src = s.open("my/dataset", "src")
flt = # calculate a filter
src['table'].apply_filter(flt)
Clone this wiki locally