This is a machine learning project on the breast Cancer dataset to predict if a person is likely to be diagnosised with breast cancer or not. In this I have tested three very basic models to predict the diagnosis. The models are namely logistic regression, decision tree and random forest.
This project is entirely done on jupyter notebook in python. I have used the following python libraries :
pandas
to read and visualize the dataseaborn
to plot the datasetmatplotlib.pyplot
to plot the graphsklearn.preprocessing.StandardScaler
for feature scalingsklearn.linear_model.LogisticRegression
for model selection
Steps to implement the project
- Open google colab or jupyter notebook.
- Either download the source code or copy the codes to a new .py file or new .ipynb file
- run the codes
I have tested it for three models here are the score for different models for both training and testing set.
Model | training Score | testing Score |
---|---|---|
Logistic regression | 0.989 | 0.956 |
Decision Tree | 1.0 | 0.938 |
Random Forest | 0.997 | 0.973 |
As evident the decision tree model is overfitted and we get the best results with the random forest model. So finally Random forest should be used for future predictions.