This notebook demonstrates how to compile and execute an End to End Machine Learning workflow that uses Katib, TFJob, KServe, and Tekton pipeline. This notebook is originated from the Kubeflow pipeline's e2e-mnist example that's running on Kubeflow 1.3.1. We have modified this notebook to run with Tekton support.
This pipeline contains 5 steps, it finds the best hyperparameter using Katib, creates PVC for storing models, processes the hyperparameter results, distributedly trains the model on TFJob with the best hyperparameter using more iterations, and finally serves the model using KServe. You can visit this medium blog for more details on this pipeline.
To run this pipeline, make sure your cluster has at least 16 cpu and 32GB in total. Otherwise some jobs might not able to run because TFJob needs to run 4 TensorFlow pods in parallel for distributed training.
- Install KFP Tekton prerequisites
- Install the necessary Python packages for running Jupyter notebook.
pip install jupyter numpy Pillow pip install kubeflow-katib==0.12.0rc0
- Make sure the Kubernetes Cluster has a storageclass that supports ReadWriteMany in order to run distributed training.
- When running KFP single user mode, give cluster admin to the KFP service account in order to run Katib and KServe on any namespace.
kubectl create clusterrolebinding pipeline-runner-extend --clusterrole cluster-admin --serviceaccount=kubeflow:pipeline-runner
Once you have completed all the prerequisites for this example, then you can start the Jupyter server in this directory and click on the mnist.ipynb
notebook. The notebook has step by step instructions for running the KFP Tekton pipeline.
python -m jupyter notebook
Or, you can compile the pipeline directly with:
python e2e-mnist.py
Thanks Hougang Liu and Andrey Velichkevich for creating the original e2e-mnist notebook.