This repository contains a machine learning project using a multiple regression model to predict life ecpectancy. These YouTube videos are a nice place to start for understanding the method:
A report describing the theory behind the project, the data set, the project process, and the results can be found in the blabla.pdf file.
Some of the project results are presented in an interactive website here, where you can predict the life excpectancy based on parameters that you decide the value of. This website also breifly describes the theory behind the methods used.
The data set used in the project is from the World Health Organization (WHO). It consists of the life expectancy for 193 countries over a period of 15 years, and different health factors, affecting the life expectancy. The life excpectancy is the response while the health factors are the parameters. The data set can be found here: WHO data set.
Relevant data for the project can be found in the data
folder:
-
life-expectancy.csv
contains the relevant data set for the project. -
life-expectancy.txt
contains information about all the variables in the data set.
You will find the code for the model and the experiments used in the project in the folder tests
. Here there are six files:
-
multiple_regression.py
: The multiple regression model is made in the functionmultiple_regression(df_regr, parameter_list)
. This function takes a data frame and a list of parameter names as arguments. First, it splits the data in training and testing sets, then performs a multiple regression as well as plotting the predicted life expectancy and the responses. This function is imported to and used in some of the other files in the project. -
backward_elimination.py
: Uses the backward elimination method to identify the most important parameters so that the model is reliable. This is done in four steps, which are explained as comments in the file. -
forward_selection.py
: Experiments with the forward selection method. -
correlation_matrix.py
: Makes a correlation matrix showing the correlation coeffictients of all the parameter pairs and visualizes it in a heatmap. -
error_plots.py
: Evaluates the model by calculationg the errors, sum of errors, mean average error and plotting these. -
dummy_variables.py
: Experiments with dummy variables on the Country parameter.
The results of the project is, as mentioned, visualized on an interactive website. The code for this website is in the app.py
file in the main folder.
In this project, the following technologies are used:
- Python – The programming language used in the project
- Dash – A framework for building data visualization web applications
- Pandas – A data analysis Python library
- scikit-learn and statsmodels – Tools such as classes and functions for predictive data analysis (useful for the multiple regression analysis)
To contribute to the project, first make a project directory locally on your computer. Inside this, create a virtual environment for the project. In this project directory, you should also clone this repository to a new folder. Some of the packages that are necessary to install for the project to run locally:
pip install pandas
pip install dash
pip install statsmodels
pip install -U scikit-learn
The rest of the packages needed for the project can be found in the requirements.txt
file.
- The website is made in the
app.py
file. - For experimentation and testing, create
.py
files in thetests
folder.
To import the enitre data set to a testing file in the tests
folder, i.e. use the following:
import pandas as pd
data_url = "https://raw.githubusercontent.com/sofieaasheim/supervised-learning-project/main/data/life-expectancy.csv"
df = pd.read_csv(data_url, sep=",")
When performing tests, make sure that you only push to the tests branch (or make a new branch). The main branch should only be used for pushing finished stuff.
To run the application locally, type python app.py
in the terminal while in the supervised-learning-project
folder and go to http://127.0.0.1:8050/ in your browser.
To run the test files, type python name_of_file.py
in the terminal while in the tests
folder.
The code for the website is deployed directly from the main branch through Heroku. The web page can be found at https://tdt4173group9.herokuapp.com/.