University of Helsinki
Introduction to Machine Learning
Fall 2020 Term Project
Bernardo Williams (GitHub: williwilliams3)
Julia Sanders (GitHub: julia-sand)
Mikko Saukkoriipi (GitHub: Saukkoriipi)
Full project report with all the details can be found from: Project_report_and_presentation/Project_report.pdf
To create the best possible machine learning model to predict potential new particle formation events based on the 100 daily features.
Several machine learning classification models were used over the dataset npf_train.csv divided with the purpose of predicting a binary and a multi-class label. The objective was to extend the model and predictions to unseen data, and also to give an estimate of the accuracy the model would have on the unseen data.
For fitting the models we used two data reduction techniques, PCA and bestK feature selection and two normalization methods, min-max and standardizing normalization. We tried fitting algorithmic, generative and discriminative methods using either validation or cross validation to measure accuracy for both the binary and multiclass classifiers and found which ones performed the best in terms of accuracy over an unbiased test set. Lastly, we found that taking the average prediction of the best algorithmic, discriminative and generative methods gave estimates with higher accuracy and more consistent accuracy over train, validation and test.
Accuracy | DT Binary | RF Binary | XGB Binary | KNN Binary | Log Binary | NB PCA | SVM | Ensamble |
---|---|---|---|---|---|---|---|---|
Training | 88% | 100% | 100% | 85% | 86% | 84% | 98% | 96% |
Validation | 84% | 87% | 90% | 78% | 85% | 87% | 90% | 96% |
Test | 88% | 88% | 87% | 80% | 85% | 93% | 83% | 92% |
Accuracy | DT Multiclass | RF Multiclass | XGB Multiclass | KNN Multiclass | NB PCA | SVM | Ensamble |
---|---|---|---|---|---|---|---|
Training | 66% | 100% | 100% | 66% | 69% | 83% | 94% |
Validation | 64% | 66% | 70% | 57.7% | 62% | 69% | 98% |
Test | 67% | 72% | 70% | 57.7% | 65% | 68% | 70% |