This is a project on predicting Data Science and STEM Salaries. This is done as part of the Machine Learning Engineering course held by Alexey Grigorev. Credits to Jack Ogozaly for scraping the dataset from level.fyi! If you want to see the original dataset on kaggle, look at it here: Dataset
There is a lot of variation with regards to Data Science and STEM salaries, ranging from the location that one is based in, to the company one is based in. Moreover, it is not very convenient to find this information on the internet, with some platforms gatekeeping information about salary. Therefore, this model takes these variables and conveniently predicts the salary that you'll receive with said variables.
The data folder consists of the original dataset, as well as the cleaned dataset.
The code folder consists of:
- Data Cleaning & EDA - part of the notebook
- Training of model & Hyperparameter Tuning (output & outputless)* - part of notebook
- train.py - to train the final model + saving it using pickle
- predict.py - to load the model and serve it via a web service
- predict-test.py and predict-test-cloud.py - to test the output of the model, depending on where you want to use it
- pipenv and pipenv.lock for the virtunal environment using pipenv
- Dockerfile for using a Docker container
as well as some files that logs the training process (catboostinfo), some files for deplyoing the model to the cloud using AWS Elastic Beanstalk and some notebook checkpoints.
* = second notebook had problems loading on github, and part of it is due to the output from the notebook. Therefore, I have decided to include both copies. Do note that the notebook with the output WILL lag, so it is up to your own discretion to download it. It is, however, still there.
I am currently using Windows, so I am using waitress in order to deploy the model. To deploy this model with waitress, please use: waitress-serve --listen=0.0.0.0:9696 predict:app
I used pipenv for the virtual environment. In order to use the same venv as me, do use pip install pipenv.
To replicate the environment, on your command line, use pipenv install numpy scikit-learn==0.24.2 catboost flask gunicorn waitress
Do note that catboost takes longer than the other modules to install.
(For reference, catboost took about a minute or two on my computer, the rest took about 10 seconds)
You can run the environment using pipenv shell, and deploy the model as normal.
To deploy the model, refer to the "Depolyment of model" part of the README.
I have built the model and pushed it to kwangyy/salary-prediction:3.8.12-slim for easy use!
To take the model from the docker container I built, just replace
FROM python:3.8.12-slim
with
FROM kwangyy/salary-prediction:3.8.12
in the dockerfile.
If you choose to build a docker file locally instead, here are the steps to do so:
-
In your command line, run:
docker run -it --rm --entrypoint=bash python:3.8.12-slim
to create a docker image. -
Create a Dockerfile as such:
FROM python:3.8.12-slim
RUN pip install pipenv
WORKDIR /app
COPY ["Pipfile", "Pipfile.lock", "./"]
RUN pipenv install --system --deploy
COPY ["predict.py", "catboostreg.bin", "./"]
EXPOSE 9696
ENTRYPOINT ["waitress-serve", "--listen=0.0.0.0:9696", "predict:app"]
This allows us to install python, run pipenv and its dependencies, run our predict script and our model itself and deploys our model using waitress. Similarly, you can just use the dockerfile in this repository.
-
Build the docker container with
Docker build -t salary-prediction .
-
Run the docker container with
Docker run -it -p 9696:9696 salary-prediction:latest
so that we can use our model!
EDIT: The instance has been terminated since I do not want to incur additional costs. However, the instructions are still there for you to refer to if you so wish to replicate the AWS Elastic Beanstalk instance.
To deploy this into the cloud using AWS Elastic Beanstalk:
- Use
pipenv install --dev awsebcli
to install the command line for AWS Elastic Beanstalk. This is because we only need Elastic Beanstalk for deploying to the cloud and not actually for the model itself. - Use
eb local run --port 9696
, allowing EB to build the Docker container. - Use
eb create salary-serving-env
to create the environment for the container itself - AWS will start creating the environment, so do give it a few minutes. Once it is done, there will be a line that says 'Application available at ....'. Copy and paste the link - that is your new host.
- If you happen to use a .py file for a request, do change your host to the link, and your url to a f-string.
e.g. If your host =
kwangyy.importantletters.us-east-2.elasticbeanstalk.com
, then your url =f'http://{host}/predict'
- To terminate, use
eb terminate salary-serving-env
to not waste your EC2 hours (you need to pay for your instance hours once it hits a certain limit, you know that right?)