What is CML? Continuous Machine Learning (CML) is an open-source library for implementing continuous integration & delivery (CI/CD) in machine learning projects. Use it to automate parts of your development workflow, including model training and evaluation, comparing ML experiments across your project history, and monitoring changing datasets.
For this project, I have used the Red Wine Quality Dataset from kaggle. This is a simple and clean practice dataset for regression and classification modelling. It consists of 1600 rows and 12 columns, it's a relatively small dataset, but good enough to under the posterior of this project. On the side note, I used the Rainbow CSV extension for VS-code to make .csv files look more attractive 😅.
To use Github actions, you need to create a special .yaml/yml file in .github/workflows/ directory. This define the workflow of the project we want to specify when a particular trigger takes place. _In this case, a [push] in the repository (irrespective of the branch) triggers the workflow. Detailed code of the workflow with comments here
An idea which is often hand in hand with continuous integration is using GitFlow. And the idea here is that whenever we want to experiment our project by adding something to our project, changing the parameters, etc, we're going to create a new branch in Git, and have the developement occur on that new branch. And then ultimately can merge it into the main branch of our project.
Propose the new changes in a new branch or an exisiting branch (except main), create a pull request in that branch. The runner runs the workflow when a trigger is detected. A runner is a server that has the GitHub Actions runner application installed. You can use a runner hosted by GitHub, or you can host your own. ... For GitHub-hosted runners, each job in a workflow runs in a fresh virtual environment. GitHub-hosted runners are based on Ubuntu Linux, Microsoft Windows, and macOS. Initially, the runner sets up the jobs, initializes the container (CML docker in this case), runs the script (Github actions checkout), runs the workflow we produced (in .yaml), post runs the script (action checkout), stops the container, and finally completes the job. All these steps are triggered on [push] (pushing a commit in the repo).
Whenever my collaborators propose some changes in the code, they can create a pull request in the experiment branch or any other, and a bot will display the Model Metrics and Data Visuals for the changes the collaborator proposed. As a team it can be cumbersome to run the training script locally every time you make changes in the code, with Github Actions this whole process is automated. So, whenever we make changes to the code, we get really fast feedback about what happened in an aesthetically pleasing format and it is easy for other team collaborators to look at/revisit it. You can see the report, go back to the commit that created that report and everything at that instance even in a closed pull request. This creates links between the code, the data, the environment, the infrastructure for training, and the results.
If you have any doubts feel free to click on the social icon you would like to connect with 🤗
If you liked my work and gained insightful knowledge, you can buy me a coffee buying click the button below 😊