GitHub - mtelewa/attrition-predictor: Repo template for Predictive Analytics Milestone Project: Bring Your Own Data

Attrition Predictor

IBM HR Analytics Employee Attrition & Performance - GitHub

Visit the live project here - https://attrition-predictor-dbb87cf1fa29.herokuapp.com/

The attrition predictor predicts whether an employee will remain in the workforce according to multiple factors like demographics, work culture, etc. Attrition in this context could be voulantary as well as involuntary leave from an organization for unpredictable or uncontrollable reasons. Managing and understanding attrition is pivotal for organizations to ensure a stable and engaged workforce. With a high attrition rate, a company is likely to shrink in size. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring and training new staff. Therefore, there is great business interest in understanding the drivers of, and minimizing staff attrition.

The main objective is to to predict if an employee is about to leave the company. This would allow human resources to intervene and prevent it by changing the conditions, if possible.

The statistical study and data analysis was carried out to understand how attrition (the target) is affected by other variables (features). The problem was treated as a classification model.

How to use this repo

Use this template to create your GitHub project repo
Log into your cloud IDE with your GitHub account.
On your Dashboard, click on the New Workspace button
Paste in the URL you copied from GitHub earlier
Click Create
Wait for the workspace to open. This can take a few minutes.
Open a new terminal and pip3 install -r requirements.txt
Open the jupyter_notebooks directory, and click on the notebook you want to open.
Click the kernel button and choose Python Environments.

Note that the kernel says Python 3.8.18 as it inherits from the workspace, so it will be Python-3.8.18 as installed by our template. To confirm this, you can use ! python --version in a notebook code cell.

Dataset Content

Important data disclaimer: This dataset was generated by IBM scientists and is made up of fictional data. The ML algorithm/predictor was built solely for learning purposes and shall not be used for drawing any real conclusions. As discussed here, the dataset is a snapshot i.e. is missing time series data thus the prediction might not be great for future events rather the current conditions.
The dataset can be found on Kaggle and it consists of 1470 rows and 35 columns i.e. a total of 51450 data points. 9 columns are categorical of Object (or string) type while the rest (26 columns) are numerical of integer type. The following summary was obtained from ProfileReport imported from ydata_profiling library.

The numerical columns (26 columns)

In 7 columns out of the 26, the integers denote a string value, therefore, they are discrete. They are explained as follows:

Attribute	Information	Units
Education	Level of education of the employee	1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor
Environment Satisfaction	Level of employee satisfaction in the workplace environment	1: Low, 2: Medium, 3: High, 4: Very High
Jon Involvment	How engaged is the employee in the workplace	1: Low, 2: Medium, 3: High, 4: Very High
Job Satisfaction	Level of employee satisfaction from the job	1: Low, 2: Medium, 3: High, 4: Very High
Performance Rating	Rating of employee performance	1: Low, 2: Good, 3: Excellent, 4: Outstanding
Relationship Satisfaction	Rating of employee performance	1: Low, 2: Medium, 3: High, 4: Very High
Work-Life Balance	Level of employee work-life balance	1: Bad, 2: Good, 3: Better, 4: Best

In 19 columns out of the 26 numerical ones, the data summary is as follows:

Attribute	Information	Range
Age	Employee age	18 - 60
Daily Rate*	Employee daily rate	102 - 1499
Distance from Home	Distance from home to workplace	1 - 29
Employee Count*	Employee full time	1
Employee Number*	Employee ID	1 - 2068
Hourly Rate*	Employee hourly rate	30 - 100
Job level	Employee job level (hierarchical)	1 - 5
Monthly Income	Employee monthly income	1009 - 19999
Monthly Rate*	Monthly rate	2094 - 26999
Num Companies Worked	Number of companies worked at	0 - 9
Percent Salary Hike	Percentage increase in salary between 2 years	11 - 25
Standard Hours*	Standard working hours per week	80
Stock Options Level	How many company stocks an employee own from the company	0 - 3
Total Working Years	Total years worked	0 - 40
Training Times Last Year	Trainings the employee had	0 - 6
Years At Company	Number of years employee stayed at the company	0 - 40
Years In Current Role	Number of years employee stayed in current role	0 - 18
Years Since Last Promotion	Years since last promotion	0 - 15
Years With Curr Manager	Years spent with current manager	0 - 17

The categorical columns (9 columns)

Attribute	Information	Units
Attrition	Employee left? (Target)	['Yes', 'No']
Business Travel	Rate of employee business travels	['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
Department	Employee department	['Sales', 'Research & Development', 'Human Resources']
Education Field	Employee education field	['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
Gender	Employee gender	['Male', 'Female']
Job Role	Employee role	['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
Marital Status	Employee marital status	['Single', 'Married', 'Divorced']
Over18*	Is Employee over 18?	['Y']
OverTime	Employee worked overtime?	['Yes', 'No']

The bold columns with aestric (*) are dropped as they either contain only one value (Employee count, Over 18, Standard hours) or considered to not affect the target (Employee number). Additionally, the hourly, daily and monthly rates were dropped as they are ambiguous (mentioned also in this discussion). The focus was on the monthly income instead.

Business Requirements

The client (a company's HR department) requested a ML algorithm from a data practitioner to predict attrition based on the dataset provided. The aim is to minimize attrition rates.

Business Requirement 1 - The client is interested in understanding the main factors leading to attrition
Business Requirement 2 - The client is interested in predicting whether a certain employee will decide to leave the company

Hypothesis and how to validate?

We suspect that monthly income plays a big role in attrition
- Correlation analysis and plots
We suspect that men tend to leave the workforce more often than women
- Correlation analysis and plots
We suspect that only few features affect attrition
- Feature selection in the ML Pipeline

The rationale to map the business requirements to the Data Visualizations and ML tasks

Business Requirement 1: Data Visualization and Correlation study
- We will inspect the dataset and look for patterns
- We will plot the main features against attrition to visualize insights
- We will do a feature importance analysis
Business Requirement 2: Classification analysis
- We want to predict if an employee will leave or not. We want to build a binary classifier.
- We want to build a pipeline for the ML predictor algorithm

ML Business Case

Predict Attrition

Classification Model

We want an ML model to predict if an employee will leave the company. The target variable is categorical and contains 2-classes. We consider a classification model. It is a supervised model, a 2-class, single-label, classification model output: 0 (no attrition), 1 (yes attrition)
The ideal outcome is to provide the HR team with a prediction of whether an employee is leaving the company.
The model success metrics are
- At least 80% precision for no attrition, on train and test set. Because we want to be sure tha the employee is not intending to leave the company.
- At least 60% precision for attrition on train and test set. Because we want to detect attririon as early as possible, we accept an even lower chance of an employee leaving in order to take action.

Dashboard Design

Page 1: Project Summary

Introduction the project and motivation
Project dataset description
Display the first ten rows of the data
State business requirements

Page 2: Project Hypothesis and Validation

We report on whether the 3 hypotheses we posed earlier are correct
Checkbox to display the corresponding plot for each hypothesis

Page 3: Attrition Correlation Analysis

State business requirement 1
Correlation analysis
Checkbox: display the most correlated variables to attrition
Checkbox: display the predictive power score (PPS) heatmap
Strongest correlation features to attrition (numerical and categorical)
Selectbox: to select individual plots showing the attrition levels for each correlated variable
Conclusions

Page 4: Attrition Predictor

State business requirement 2
Set of widgets inputs
"Run predictive analysis" button that serves the prospect data to our ML pipelines and predicts if an employee will leave or not.

Page 5: Model Performance

Model success metrics
Present ML pipeline steps
Feature importance (as a list and a barplot figure), this is related to business requirement 1
Pipeline performance, classification report and confusion matrix

Page 6: Project Conclusions

Considerations and conclusions after the pipeline is trained
Project outcomes

Technologies Used

The technologies used throughout the development are listed below:

Languages

Python

Python Packages

Pandas - Data analysis and manipulation tool
Numpy - The fundamental package for scientific computing with Python
YData Profiling - For data profiling and exploratory data analysis
Matplotlib - Comprehensive library for creating static, animated and interactive visualisations
Seaborn - Data visualisation library for drawing attractive and informative statistical graphics
Pingouin - Statistical package for simple yet exhaustive stats functions
Feature-engine - Transformers to engineer and select features for machine learning models
ppscore - Data-type-agnostic score that can detect linear or non-linear relationships between two columns
scikit-learn - Machine learning library for training the ML model
XGBoost - Optimised distributed gradient boosting library
Imbalanced-learn - Tool for dealing with classification problems with imbalanced target
Joblib - Tool for dumping pipeline to pickle files
Kaggle - Kaggle API functionalit
Streamlit - Build the web app.

Other Technologies

Git - For version control
GitHub - Code repository
Heroku - For application deployment
Gitpod - Cloud IDE used for development
Jupyter Notebook - Interactive Python
CI Python Linter - Style guide for python

Unfixed Bugs

There are no unfixed bugs except for Jupyter notebook sometimes not plotting when the Run All button is pressed.
Sometimes I get StreamlitAPIException: set_page_config() can only be called once per app page, with a refresh of the webpage the streamlit app works fine again. The source of the issue is unknown.

Deployment

Set Heroku stack

To log into the Heroku toolbelt CLI:

Install the client in the terminal curl https://cli-assets.heroku.com/install-ubuntu.sh | sh
Log in to your Heroku account and go to Account Settings in the menu under your avatar.
Scroll down to the API Key and click Reveal
Copy the key
In the terminal, run heroku_login -i
Enter your email and paste in your API key when asked
Set the stack to heroku-20 heroku stack:set heroku-20 --app attrition-predictor
In this repo, set the runtime.txt Python version to python-3.8.19

The App live link is: https://attrition-predictor-dbb87cf1fa29.herokuapp.com/

Deployment steps

Log in to Heroku and create an App
At the Deploy tab, select GitHub as the deployment method.
Select your repository name and click Search. Once it is found, click Connect.
Select the branch you want to deploy, then click Deploy Branch.
The deployment process should happen smoothly if all deployment files are fully functional. Click now the button Open App on the top of the page to access your App.
If the slug size is too large then add large files not required for the app to the .slugignore file.

Important configuration files

setup.sh should contain the following

mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml

Procfile should contain

web: sh setup.sh && streamlit run app.py

Credits

Helper functions and custom classes snippets were used in this project were provided by Code Institute. These are mainly adapted from the predictive analytics module.

Content

The idea originated from a search on most used datasets on Kaggle. The content is already explained in the dataset content section.

Acknowledgements (optional)

Thanks to my mentor Mo Shami, for his support and guidance on the execution of the project. Thanks to Sean Tilson, my colleague at code institute for the discussion on the model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.vscode		.vscode
app_pages		app_pages
images		images
inputs/datasets/raw		inputs/datasets/raw
jupyter_notebooks		jupyter_notebooks
outputs		outputs
src		src
.gitignore		.gitignore
.gitpod.dockerfile		.gitpod.dockerfile
.gitpod.yml		.gitpod.yml
.slugignore		.slugignore
Procfile		Procfile
README.md		README.md
app.py		app.py
data_cleaning_and_feature_engineering.xlsx		data_cleaning_and_feature_engineering.xlsx
requirements.txt		requirements.txt
requirements_full.txt		requirements_full.txt
runtime.txt		runtime.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attrition Predictor

Table of Contents

How to use this repo

Dataset Content

The numerical columns (26 columns)

The categorical columns (9 columns)

Business Requirements

Hypothesis and how to validate?

The rationale to map the business requirements to the Data Visualizations and ML tasks

ML Business Case

Predict Attrition

Classification Model

Dashboard Design

Page 1: Project Summary

Page 2: Project Hypothesis and Validation

Page 3: Attrition Correlation Analysis

Page 4: Attrition Predictor

Page 5: Model Performance

Page 6: Project Conclusions

Technologies Used

Languages

Python Packages

Other Technologies

Unfixed Bugs

Deployment

Set Heroku stack

Deployment steps

Important configuration files

Credits

Content

Acknowledgements (optional)

About

Releases

Packages

Languages

mtelewa/attrition-predictor

Folders and files

Latest commit

History

Repository files navigation

Attrition Predictor

Table of Contents

How to use this repo

Dataset Content

The numerical columns (26 columns)

The categorical columns (9 columns)

Business Requirements

Hypothesis and how to validate?

The rationale to map the business requirements to the Data Visualizations and ML tasks

ML Business Case

Predict Attrition

Classification Model

Dashboard Design

Page 1: Project Summary

Page 2: Project Hypothesis and Validation

Page 3: Attrition Correlation Analysis

Page 4: Attrition Predictor

Page 5: Model Performance

Page 6: Project Conclusions

Technologies Used

Languages

Python Packages

Other Technologies

Unfixed Bugs

Deployment

Set Heroku stack

Deployment steps

Important configuration files

Credits

Content

Acknowledgements (optional)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages