IBM HR Analytics Employee Attrition & Performance - GitHub
Visit the live project here - https://attrition-predictor-dbb87cf1fa29.herokuapp.com/
The attrition predictor predicts whether an employee will remain in the workforce according to multiple factors like demographics, work culture, etc. Attrition in this context could be voulantary as well as involuntary leave from an organization for unpredictable or uncontrollable reasons. Managing and understanding attrition is pivotal for organizations to ensure a stable and engaged workforce. With a high attrition rate, a company is likely to shrink in size. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring and training new staff. Therefore, there is great business interest in understanding the drivers of, and minimizing staff attrition.
The main objective is to to predict if an employee is about to leave the company. This would allow human resources to intervene and prevent it by changing the conditions, if possible.
The statistical study and data analysis was carried out to understand how attrition (the target) is affected by other variables (features). The problem was treated as a classification model.
- How to use this repo
- Dataset Content
- Business Requirements
- Hypothesis
- Mapping Business Requirements to Data Visualisation and ML Tasks
- ML Business Case
- Dashboard Design
- Technologies Used
- Unfixed Bugs
- Deployment
- Credits
- Acknowledgements
- Use this template to create your GitHub project repo
- Log into your cloud IDE with your GitHub account.
- On your Dashboard, click on the New Workspace button
- Paste in the URL you copied from GitHub earlier
- Click Create
- Wait for the workspace to open. This can take a few minutes.
- Open a new terminal and
pip3 install -r requirements.txt
- Open the jupyter_notebooks directory, and click on the notebook you want to open.
- Click the kernel button and choose Python Environments.
Note that the kernel says Python 3.8.18 as it inherits from the workspace, so it will be Python-3.8.18 as installed by our template. To confirm this, you can use ! python --version
in a notebook code cell.
-
Important data disclaimer: This dataset was generated by IBM scientists and is made up of fictional data. The ML algorithm/predictor was built solely for learning purposes and shall not be used for drawing any real conclusions. As discussed here, the dataset is a snapshot i.e. is missing time series data thus the prediction might not be great for future events rather the current conditions.
-
The dataset can be found on Kaggle and it consists of 1470 rows and 35 columns i.e. a total of 51450 data points. 9 columns are categorical of Object (or string) type while the rest (26 columns) are numerical of integer type. The following summary was obtained from
ProfileReport
imported fromydata_profiling
library.
In 7 columns out of the 26, the integers denote a string value, therefore, they are discrete. They are explained as follows:
Attribute | Information | Units |
---|---|---|
Education | Level of education of the employee | 1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor |
Environment Satisfaction | Level of employee satisfaction in the workplace environment | 1: Low, 2: Medium, 3: High, 4: Very High |
Jon Involvment | How engaged is the employee in the workplace | 1: Low, 2: Medium, 3: High, 4: Very High |
Job Satisfaction | Level of employee satisfaction from the job | 1: Low, 2: Medium, 3: High, 4: Very High |
Performance Rating | Rating of employee performance | 1: Low, 2: Good, 3: Excellent, 4: Outstanding |
Relationship Satisfaction | Rating of employee performance | 1: Low, 2: Medium, 3: High, 4: Very High |
Work-Life Balance | Level of employee work-life balance | 1: Bad, 2: Good, 3: Better, 4: Best |
In 19 columns out of the 26 numerical ones, the data summary is as follows:
Attribute | Information | Range |
---|---|---|
Age | Employee age | 18 - 60 |
Daily Rate* | Employee daily rate | 102 - 1499 |
Distance from Home | Distance from home to workplace | 1 - 29 |
Employee Count* | Employee full time | 1 |
Employee Number* | Employee ID | 1 - 2068 |
Hourly Rate* | Employee hourly rate | 30 - 100 |
Job level | Employee job level (hierarchical) | 1 - 5 |
Monthly Income | Employee monthly income | 1009 - 19999 |
Monthly Rate* | Monthly rate | 2094 - 26999 |
Num Companies Worked | Number of companies worked at | 0 - 9 |
Percent Salary Hike | Percentage increase in salary between 2 years | 11 - 25 |
Standard Hours* | Standard working hours per week | 80 |
Stock Options Level | How many company stocks an employee own from the company | 0 - 3 |
Total Working Years | Total years worked | 0 - 40 |
Training Times Last Year | Trainings the employee had | 0 - 6 |
Years At Company | Number of years employee stayed at the company | 0 - 40 |
Years In Current Role | Number of years employee stayed in current role | 0 - 18 |
Years Since Last Promotion | Years since last promotion | 0 - 15 |
Years With Curr Manager | Years spent with current manager | 0 - 17 |
Attribute | Information | Units |
---|---|---|
Attrition | Employee left? (Target) | ['Yes', 'No'] |
Business Travel | Rate of employee business travels | ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'] |
Department | Employee department | ['Sales', 'Research & Development', 'Human Resources'] |
Education Field | Employee education field | ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources'] |
Gender | Employee gender | ['Male', 'Female'] |
Job Role | Employee role | ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources'] |
Marital Status | Employee marital status | ['Single', 'Married', 'Divorced'] |
Over18* | Is Employee over 18? | ['Y'] |
OverTime | Employee worked overtime? | ['Yes', 'No'] |
The bold columns with aestric (*) are dropped as they either contain only one value (Employee count, Over 18, Standard hours) or considered to not affect the target (Employee number). Additionally, the hourly, daily and monthly rates were dropped as they are ambiguous (mentioned also in this discussion). The focus was on the monthly income instead.
The client (a company's HR department) requested a ML algorithm from a data practitioner to predict attrition based on the dataset provided. The aim is to minimize attrition rates.
- Business Requirement 1 - The client is interested in understanding the main factors leading to attrition
- Business Requirement 2 - The client is interested in predicting whether a certain employee will decide to leave the company
- We suspect that monthly income plays a big role in attrition
- Correlation analysis and plots
- We suspect that men tend to leave the workforce more often than women
- Correlation analysis and plots
- We suspect that only few features affect attrition
- Feature selection in the ML Pipeline
- Business Requirement 1: Data Visualization and Correlation study
- We will inspect the dataset and look for patterns
- We will plot the main features against attrition to visualize insights
- We will do a feature importance analysis
- Business Requirement 2: Classification analysis
- We want to predict if an employee will leave or not. We want to build a binary classifier.
- We want to build a pipeline for the ML predictor algorithm
-
We want an ML model to predict if an employee will leave the company. The target variable is categorical and contains 2-classes. We consider a classification model. It is a supervised model, a 2-class, single-label, classification model output: 0 (no attrition), 1 (yes attrition)
-
The ideal outcome is to provide the HR team with a prediction of whether an employee is leaving the company.
-
The model success metrics are
- At least 80% precision for no attrition, on train and test set. Because we want to be sure tha the employee is not intending to leave the company.
- At least 60% precision for attrition on train and test set. Because we want to detect attririon as early as possible, we accept an even lower chance of an employee leaving in order to take action.
- Introduction the project and motivation
- Project dataset description
- Display the first ten rows of the data
- State business requirements
- We report on whether the 3 hypotheses we posed earlier are correct
- Checkbox to display the corresponding plot for each hypothesis
- State business requirement 1
- Correlation analysis
- Checkbox: display the most correlated variables to attrition
- Checkbox: display the predictive power score (PPS) heatmap
- Strongest correlation features to attrition (numerical and categorical)
- Selectbox: to select individual plots showing the attrition levels for each correlated variable
- Conclusions
- State business requirement 2
- Set of widgets inputs
- "Run predictive analysis" button that serves the prospect data to our ML pipelines and predicts if an employee will leave or not.
- Model success metrics
- Present ML pipeline steps
- Feature importance (as a list and a barplot figure), this is related to business requirement 1
- Pipeline performance, classification report and confusion matrix
- Considerations and conclusions after the pipeline is trained
- Project outcomes
The technologies used throughout the development are listed below:
- Pandas - Data analysis and manipulation tool
- Numpy - The fundamental package for scientific computing with Python
- YData Profiling - For data profiling and exploratory data analysis
- Matplotlib - Comprehensive library for creating static, animated and interactive visualisations
- Seaborn - Data visualisation library for drawing attractive and informative statistical graphics
- Pingouin - Statistical package for simple yet exhaustive stats functions
- Feature-engine - Transformers to engineer and select features for machine learning models
- ppscore - Data-type-agnostic score that can detect linear or non-linear relationships between two columns
- scikit-learn - Machine learning library for training the ML model
- XGBoost - Optimised distributed gradient boosting library
- Imbalanced-learn - Tool for dealing with classification problems with imbalanced target
- Joblib - Tool for dumping pipeline to pickle files
- Kaggle - Kaggle API functionalit
- Streamlit - Build the web app.
- Git - For version control
- GitHub - Code repository
- Heroku - For application deployment
- Gitpod - Cloud IDE used for development
- Jupyter Notebook - Interactive Python
- CI Python Linter - Style guide for python
- There are no unfixed bugs except for Jupyter notebook sometimes not plotting when the
Run All
button is pressed. - Sometimes I get
StreamlitAPIException: set_page_config() can only be called once per app page
, with a refresh of the webpage the streamlit app works fine again. The source of the issue is unknown.
To log into the Heroku toolbelt CLI:
- Install the client in the terminal
curl https://cli-assets.heroku.com/install-ubuntu.sh | sh
- Log in to your Heroku account and go to Account Settings in the menu under your avatar.
- Scroll down to the API Key and click Reveal
- Copy the key
- In the terminal, run
heroku_login -i
- Enter your email and paste in your API key when asked
- Set the stack to heroku-20
heroku stack:set heroku-20 --app attrition-predictor
- In this repo, set the
runtime.txt
Python version topython-3.8.19
The App live link is: https://attrition-predictor-dbb87cf1fa29.herokuapp.com/
- Log in to Heroku and create an App
- At the Deploy tab, select GitHub as the deployment method.
- Select your repository name and click Search. Once it is found, click Connect.
- Select the branch you want to deploy, then click Deploy Branch.
- The deployment process should happen smoothly if all deployment files are fully functional. Click now the button Open App on the top of the page to access your App.
- If the slug size is too large then add large files not required for the app to the .slugignore file.
setup.sh
should contain the following
mkdir -p ~/.streamlit/
echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml
Procfile
should contain
web: sh setup.sh && streamlit run app.py
- Helper functions and custom classes snippets were used in this project were provided by Code Institute. These are mainly adapted from the predictive analytics module.
- The idea originated from a search on most used datasets on Kaggle. The content is already explained in the dataset content section.
- Thanks to my mentor Mo Shami, for his support and guidance on the execution of the project. Thanks to Sean Tilson, my colleague at code institute for the discussion on the model performance.