Skip to content
Malachi Schram edited this page Apr 11, 2024 · 4 revisions

This page aims to summarize the key aspects of the Jefferson Lab Data Science Toolkit. Moreover, it will highlight a few rules for writing and implementing a new module.

The composable Data Science Workflow

Nearly every research project conducted at the Jefferson Lab Data Science department involves the following four steps:

  1. Load a data set (e.g. a pandas dataframe)
  2. Prepare the data set (e.g. normalize all digits between 0 and 1)
  3. Train and evaluate a model (e.g. a keras klassifier)
  4. Run a post analysis (the plots that usually go into your report or publication)

You are most likely working on more than one project, thus it can be cumbersome to repeat these four steps every single time. This is the reason why this repository has been created: Avoid repetitive and redundant work. Shown below is the workflow containing the four steps presented above. The information flows from left to right.

jdst_scheme

A unique feature of this workflow is that all components are modules that can be exchanged. For example: You want to analyze a data set with a GAN but also with a VAE for comparison. Instead of writing two workflows from scratch, you simply run the workflow above twice, each time with a different model module. The rest stays the same.

The Workflow Modules

From now on, the individual workflow components will be called module. Each module is dedicated to a certain task (e.g. load data, or train a deep language network). Anyone is welcome to write, design or contribute a new module. However, there a certain rules that will be discusses in the following.

The Module Core Classes

Imagine you wrote a computational expensive and complex data preparation module and you wish to hand it over to your colleagues. They might want to know how to run your module, i.e. which functions to call. This can be avoided by agreeing on core functions that EVERY data preparation module has to have. In the above case, the function that runs the data preparation is simply:

.run(data)

All functions that define a specific module are set up in its core class.

The Module Registry

Another important feature of the workflow is that every module is registered. This supports the idea of exchanging modules in an efficient manner. For example: A module that you just designed is called:

my_numpy_scaling_module.py

and it is stored in the data preparation folder. The "classical" way of accessing this module and its functions would be:

from workflow.data_preparation.my_numpy_scaling_module import ClassName

This if fine, except for the fact that you have to update this line every time you want to use a different data preparation module... This can become annoying really fast. Instead, your module will be registered under the name: "MyNumpyScaler_v0". Using this, the above line becomes:

import workflow.data_preparation as preparation

# Load a specific data preparation module:
prep = preparation.make("MyNumpyScaler_v0")
# Run the data preparation on data set:
preped_data = prep.run(data)

Unit Tests

These are your best friends! Unit tests ensure that the logic of your module is solid, i.e. your module does what it is supposed to do. Besides that, unit tests allow for nice debugging while you are developing your module. Tip: Code your module and unit test in parallel. This saves time and you debug your module during its development stage.

Clone this wiki locally