Skip to content

Repo template for Predictive Analytics Milestone Project: Bring Your Own Data

Notifications You must be signed in to change notification settings

mtelewa/attrition-predictor

 
 

Repository files navigation

Attrition Predictor

Image source & topic article

IBM HR Analytics Employee Attrition & Performance - GitHub

Visit the live project here - https://attrition-predictor-dbb87cf1fa29.herokuapp.com/

The attrition predictor predicts whether an employee will remain in the workforce according to multiple factors like demographics, work culture, etc. Attrition in this context could be voulantary as well as involuntary leave from an organization for unpredictable or uncontrollable reasons. Managing and understanding attrition is pivotal for organizations to ensure a stable and engaged workforce. With a high attrition rate, a company is likely to shrink in size. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring and training new staff. Therefore, there is great business interest in understanding the drivers of, and minimizing staff attrition.

The main objective is to to predict if an employee is about to leave the company. This would allow human resources to intervene and prevent it by changing the conditions, if possible.

The statistical study and data analysis was carried out to understand how attrition (the target) is affected by other variables (features). The problem was treated as a classification model.

Table of Contents

How to use this repo

  1. Use this template to create your GitHub project repo
  2. Log into your cloud IDE with your GitHub account.
  3. On your Dashboard, click on the New Workspace button
  4. Paste in the URL you copied from GitHub earlier
  5. Click Create
  6. Wait for the workspace to open. This can take a few minutes.
  7. Open a new terminal and pip3 install -r requirements.txt
  8. Open the jupyter_notebooks directory, and click on the notebook you want to open.
  9. Click the kernel button and choose Python Environments.

Note that the kernel says Python 3.8.18 as it inherits from the workspace, so it will be Python-3.8.18 as installed by our template. To confirm this, you can use ! python --version in a notebook code cell.

Dataset Content

  • Important data disclaimer: This dataset was generated by IBM scientists and is made up of fictional data. The ML algorithm/predictor was built solely for learning purposes and shall not be used for drawing any real conclusions. As discussed here, the dataset is a snapshot i.e. is missing time series data thus the prediction might not be great for future events rather the current conditions.

  • The dataset can be found on Kaggle and it consists of 1470 rows and 35 columns i.e. a total of 51450 data points. 9 columns are categorical of Object (or string) type while the rest (26 columns) are numerical of integer type. The following summary was obtained from ProfileReport imported from ydata_profiling library.

The numerical columns (26 columns)

In 7 columns out of the 26, the integers denote a string value, therefore, they are discrete. They are explained as follows:

Attribute Information Units
Education Level of education of the employee 1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor
Environment Satisfaction Level of employee satisfaction in the workplace environment 1: Low, 2: Medium, 3: High, 4: Very High
Jon Involvment How engaged is the employee in the workplace 1: Low, 2: Medium, 3: High, 4: Very High
Job Satisfaction Level of employee satisfaction from the job 1: Low, 2: Medium, 3: High, 4: Very High
Performance Rating Rating of employee performance 1: Low, 2: Good, 3: Excellent, 4: Outstanding
Relationship Satisfaction Rating of employee performance 1: Low, 2: Medium, 3: High, 4: Very High
Work-Life Balance Level of employee work-life balance 1: Bad, 2: Good, 3: Better, 4: Best

In 19 columns out of the 26 numerical ones, the data summary is as follows:

Attribute Information Range
Age Employee age 18 - 60
Daily Rate* Employee daily rate 102 - 1499
Distance from Home Distance from home to workplace 1 - 29
Employee Count* Employee full time 1
Employee Number* Employee ID 1 - 2068
Hourly Rate* Employee hourly rate 30 - 100
Job level Employee job level (hierarchical) 1 - 5
Monthly Income Employee monthly income 1009 - 19999
Monthly Rate* Monthly rate 2094 - 26999
Num Companies Worked Number of companies worked at 0 - 9
Percent Salary Hike Percentage increase in salary between 2 years 11 - 25
Standard Hours* Standard working hours per week 80
Stock Options Level How many company stocks an employee own from the company 0 - 3
Total Working Years Total years worked 0 - 40
Training Times Last Year Trainings the employee had 0 - 6
Years At Company Number of years employee stayed at the company 0 - 40
Years In Current Role Number of years employee stayed in current role 0 - 18
Years Since Last Promotion Years since last promotion 0 - 15
Years With Curr Manager Years spent with current manager 0 - 17

The categorical columns (9 columns)

Attribute Information Units
Attrition Employee left? (Target) ['Yes', 'No']
Business Travel Rate of employee business travels ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
Department Employee department ['Sales', 'Research & Development', 'Human Resources']
Education Field Employee education field ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
Gender Employee gender ['Male', 'Female']
Job Role Employee role ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
Marital Status Employee marital status ['Single', 'Married', 'Divorced']
Over18* Is Employee over 18? ['Y']
OverTime Employee worked overtime? ['Yes', 'No']

The bold columns with aestric (*) are dropped as they either contain only one value (Employee count, Over 18, Standard hours) or considered to not affect the target (Employee number). Additionally, the hourly, daily and monthly rates were dropped as they are ambiguous (mentioned also in this discussion). The focus was on the monthly income instead.

Business Requirements

The client (a company's HR department) requested a ML algorithm from a data practitioner to predict attrition based on the dataset provided. The aim is to minimize attrition rates.

  • Business Requirement 1 - The client is interested in understanding the main factors leading to attrition
  • Business Requirement 2 - The client is interested in predicting whether a certain employee will decide to leave the company

Hypothesis and how to validate?

  • We suspect that monthly income plays a big role in attrition
    • Correlation analysis and plots
  • We suspect that men tend to leave the workforce more often than women
    • Correlation analysis and plots
  • We suspect that only few features affect attrition
    • Feature selection in the ML Pipeline

The rationale to map the business requirements to the Data Visualizations and ML tasks

  • Business Requirement 1: Data Visualization and Correlation study
    • We will inspect the dataset and look for patterns
    • We will plot the main features against attrition to visualize insights
    • We will do a feature importance analysis
  • Business Requirement 2: Classification analysis
    • We want to predict if an employee will leave or not. We want to build a binary classifier.
    • We want to build a pipeline for the ML predictor algorithm

ML Business Case

Predict Attrition

Classification Model

  • We want an ML model to predict if an employee will leave the company. The target variable is categorical and contains 2-classes. We consider a classification model. It is a supervised model, a 2-class, single-label, classification model output: 0 (no attrition), 1 (yes attrition)

  • The ideal outcome is to provide the HR team with a prediction of whether an employee is leaving the company.

  • The model success metrics are

    • At least 80% precision for no attrition, on train and test set. Because we want to be sure tha the employee is not intending to leave the company.
    • At least 60% precision for attrition on train and test set. Because we want to detect attririon as early as possible, we accept an even lower chance of an employee leaving in order to take action.

Dashboard Design

Page 1: Project Summary

  • Introduction the project and motivation
  • Project dataset description
  • Display the first ten rows of the data
  • State business requirements

Page 2: Project Hypothesis and Validation

  • We report on whether the 3 hypotheses we posed earlier are correct
  • Checkbox to display the corresponding plot for each hypothesis

Page 3: Attrition Correlation Analysis

  • State business requirement 1
  • Correlation analysis
  • Checkbox: display the most correlated variables to attrition
  • Checkbox: display the predictive power score (PPS) heatmap
  • Strongest correlation features to attrition (numerical and categorical)
  • Selectbox: to select individual plots showing the attrition levels for each correlated variable
  • Conclusions

Page 4: Attrition Predictor

  • State business requirement 2
  • Set of widgets inputs
  • "Run predictive analysis" button that serves the prospect data to our ML pipelines and predicts if an employee will leave or not.

Page 5: Model Performance

  • Model success metrics
  • Present ML pipeline steps
  • Feature importance (as a list and a barplot figure), this is related to business requirement 1
  • Pipeline performance, classification report and confusion matrix

Page 6: Project Conclusions

  • Considerations and conclusions after the pipeline is trained
  • Project outcomes

Technologies Used

The technologies used throughout the development are listed below:

Languages

Python Packages

  • Pandas - Data analysis and manipulation tool
  • Numpy - The fundamental package for scientific computing with Python
  • YData Profiling - For data profiling and exploratory data analysis
  • Matplotlib - Comprehensive library for creating static, animated and interactive visualisations
  • Seaborn - Data visualisation library for drawing attractive and informative statistical graphics
  • Pingouin - Statistical package for simple yet exhaustive stats functions
  • Feature-engine - Transformers to engineer and select features for machine learning models
  • ppscore - Data-type-agnostic score that can detect linear or non-linear relationships between two columns
  • scikit-learn - Machine learning library for training the ML model
  • XGBoost - Optimised distributed gradient boosting library
  • Imbalanced-learn - Tool for dealing with classification problems with imbalanced target
  • Joblib - Tool for dumping pipeline to pickle files
  • Kaggle - Kaggle API functionalit
  • Streamlit - Build the web app.

Other Technologies

Unfixed Bugs

  • There are no unfixed bugs except for Jupyter notebook sometimes not plotting when the Run All button is pressed.
  • Sometimes I get StreamlitAPIException: set_page_config() can only be called once per app page, with a refresh of the webpage the streamlit app works fine again. The source of the issue is unknown.

Deployment

Set Heroku stack

To log into the Heroku toolbelt CLI:

  1. Install the client in the terminal curl https://cli-assets.heroku.com/install-ubuntu.sh | sh
  2. Log in to your Heroku account and go to Account Settings in the menu under your avatar.
  3. Scroll down to the API Key and click Reveal
  4. Copy the key
  5. In the terminal, run heroku_login -i
  6. Enter your email and paste in your API key when asked
  7. Set the stack to heroku-20 heroku stack:set heroku-20 --app attrition-predictor
  8. In this repo, set the runtime.txt Python version to python-3.8.19

The App live link is: https://attrition-predictor-dbb87cf1fa29.herokuapp.com/

Deployment steps

  1. Log in to Heroku and create an App
  2. At the Deploy tab, select GitHub as the deployment method.
  3. Select your repository name and click Search. Once it is found, click Connect.
  4. Select the branch you want to deploy, then click Deploy Branch.
  5. The deployment process should happen smoothly if all deployment files are fully functional. Click now the button Open App on the top of the page to access your App.
  6. If the slug size is too large then add large files not required for the app to the .slugignore file.

Important configuration files

  • setup.sh should contain the following
mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml
  • Procfile should contain
web: sh setup.sh && streamlit run app.py

Credits

  • Helper functions and custom classes snippets were used in this project were provided by Code Institute. These are mainly adapted from the predictive analytics module.

Content

  • The idea originated from a search on most used datasets on Kaggle. The content is already explained in the dataset content section.

Acknowledgements (optional)

  • Thanks to my mentor Mo Shami, for his support and guidance on the execution of the project. Thanks to Sean Tilson, my colleague at code institute for the discussion on the model performance.

About

Repo template for Predictive Analytics Milestone Project: Bring Your Own Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 84.3%
  • Python 14.7%
  • Other 1.0%