Cowait is a system for packaging a project with its dependencies into a Docker image, which can then be run as a container either on the local machine or on a Kubernetes cluster. It alleviates several problems in data engineering such as dependency management, reproducibility, version control and parallel computation. Cowait runs code as containerized tasks, and a task can start subtasks with parameters and return values. These subtasks run in parallel as separate containers, which enables parallel computation.
A Cowait notebook is essentially a Jupyter notebook running with a Cowait kernel. This enables the notebook to act as if it was as Cowait task, which means it can start new Cowait tasks in the background. The notebook can run either locally or in a Kubernetes cluster, and the notebook works in the same way in both cases.
One of the defining differences between Cowait notebooks and other cloud notebooks is the access to the local file system. When starting a Cowait notebook from the command line it will automatically receive access to the current working directory, using a networked file system set up in the background.
In this lab you will learn how to use Cowait Notebooks by creating a simple, yet realistic, project. The notebooks will run on a Kubernetes cluster, but all the project files will reside on your computer. The lab takes around 20 minutes.
Please complete the following steps before proceeding.
- git
- Docker
- Python 3.7
-
Install Cowait:
$ pip3 install cowait
If you already have Cowait installed, make sure it is at least version 0.4.23.
-
Clone the demo repository:
$ git clone https://github.com/backtick-se/cowait-notebook-eval $ cd cowait-notebook-eval
You will need an image registry to distribute your code to the cluster. The easiest way is to sign up for a free account on Docker Hub at https://hub.docker.com/signup
After signing up, ensure your Docker client is logged in:
$ docker login
Participants of the evaluation study should have received a kubeconfig.yaml
file that can be used to access the evaluation cluster. If you are not participating in the evaluation, you will have to set up your own Cowait cluster. A traefik2 reverse proxy deployment is required.
Put the provided kubeconfig file in the current working directory. Then, set the KUBECONFIG
environment variable:
$ export KUBECONFIG=$(pwd)/kubeconfig.yaml
The goal of part one is to create a notebook that computes a value we are interested in. Then, we turn the notebook into a Cowait task, so that it can be executed as a batch job.
-
Open
cowait.yml
and update theimage
setting to<your dockerhub username>/cowait-notebook-eval
. This configures the name of the container image that will contain all our code and dependencies. -
Create a
requirements.txt
file and addpandas
-
Build the container image, and push it to your registry:
$ cowait build --push
-
Launch a Cowait Notebook using your newly created image:
$ cowait notebook --cluster demo
It might take a few minutes for the cluster to download the image. Once the task is running, a link will be displayed. Open it to access the notebook.
-
Create a new notebook called
volume.ipynb
. Make sure to select the Cowait kernel. -
Download some data into a pandas dataframe. The dataset contains every trade executed on the Bitmex cryptocurrency derivatives platform, divided into one file per day.
import pandas date = '20210101' df = pandas.read_csv(f'https://s3-eu-west-1.amazonaws.com/public.bitmex.com/data/trade/{date}.csv.gz')
-
We want to compute the total US dollar value of Bitcoin contracts over the course of the day. Bitcoin Perpetual Futures contracts have the ticker symbol
XBTUSD
. To do this, use pandas to find all the rows containingXBTUSD
transactions, and sum thesize
column.volume = int(df[df.symbol == 'XBTUSD'].size.sum()) print(volume)
-
Parameterize the notebook by changing the date variable to an input parameter:
date = cowait.input('date', '20210101')
In Cowait, inputs allow us to send arguments to tasks. Later, we can substitute the input value to execute the notebook code for any date we like. If no input is set, the default value
20210101
will be used. -
Return the total volume from the notebook using
cowait.exit()
:cowait.exit(volume)
Similarly to inputs, tasks can also return outputs. Returning an output allows us to invoke the notebook and use the computed value elsewhere.
-
Write a simple sanity test for the notebook that verifies the computation for a date with a known volume. Create a file called
test_compute_volume.py
with your favorite text editor:# test_compute_volume.py from cowait.tasks.notebook import NotebookRunner async def test_compute_volume(): vol = await NotebookRunner(path='volume.ipynb', date='20210101') assert vol == 2556420
The
NotebookRunner
task executes a notebook file and returns any value provided tocowait.exit()
. -
Open a new terminal in the same folder and run the test. Make sure it passes.
$ cowait test
Contrary to the notebook, the tests will run in a Docker container on your computer.
-
Now is a good time to save your progress. Since the files are available on your local machine, use your git client to create a commit.
$ git add . $ git commit -m 'Volume notebook'
We now have a notebook for calculating the volume for one day. But what if we want to know the volume for several days? While we could create a loop and download each day in sequence, it would be much more efficient to do it all at once, in parallel.
-
Create a new notebook with the Cowait kernel, and call it
batch.ipynb
. -
First, we will create two input parameters and create a range of dates that we are interested in.
from helpers import daterange start = cowait.input('start', '20210101') end = cowait.input('end', '20210104') dates = [ date for date in daterange(start, end) ] dates
-
Then, we can create a
NotebookRunner
for each date in the list. This will start four new tasks, each calculating the volume for one day. While these are running the notebook can perform other calculations.subtasks = [ NotebookRunner(path='volume.ipynb', date=date) for date in dates ]
-
To get the results of the calculations we need to wait for each task to finish:
# just for reference, dont try to run this result1 = await task1 result2 = await task2
Since we have a list of pending tasks, we can use
cowait.join
. Create a new cell with the following code:results = await cowait.join(subtasks)
-
Finally let's print the results:
print(results)
-
Write the results to a JSON file on your local machine.
import json with open('result.json', 'w') as f: json.dump(results, fp=f)
-
Use the
Run All Cells
feature in theRun
menu to try out the notebook. This will run a tasks for each day in the date range, in paralell, on the cluster. -
Now is a good time to save your progress.
$ git add . $ git commit -m 'Volume batch notebook'
We now have a runnable notebook, and it is time to put it into production. We can run the batch
notebook without Jupyter using the command line.
-
Open a terminal in the same folder and make sure the
KUBECONFIG
environment variable is set:$ export KUBECONFIG=$(pwd)/kubeconfig.yaml
-
Before we can run tasks on the cluster we have to push an updated container image to a docker registry. This image will bundle all the code you've written along with any dependencies required to run it. It will continue to work as written, forever.
$ cowait build --push
-
The notebook can now be executed on the cluster as a batch job for a range of dates.
$ cowait notebook run batch.ipynb \ --cluster demo \ --input start=20210201 \ --input end=20210207
- Who are you, and where do you work? What is your role?
- Briefly describe your overall impression of working with Cowait notebooks. Any questions? Any difficulties?
- What solutions do you currently use for notebooks/cloud compute?
- Do you currently experience any problems working with cloud notebooks in your organization?
- File management
- Version control
- Lack of tooling (linters, testing etc)
- Do you think cloud notebooks with a shared file system could help improve the data science workflow in your organization? Why/why not?
- Do you see any other advantages in having access to your local file system when working in a cloud notebook? Any drawbacks?