Skip to content

Binary Classification: Logistic Regression SMOTE, Isolation Forest, LOF, xgboost, lightgbm, catboost, pyspark, keras

Notifications You must be signed in to change notification settings

bhishanpdl/Project_Fraud_Detection

Repository files navigation

Project Description

In this project I used the Kaggle Creditcard Fraud data to determine whether the transaction is fraud or not.

Assumptions:

  • Here we have data for just two days, but we assume the data is representative of the whole population of credit card transactions.
  • We assume the 28 variables V1-V28 are obtained from correct method of PCA and are scaled properly.

Metric Used

  • Here 1 means fraud and 0 means no fraud.
  • For this imbalanced dataset, false negative (fraud classified as not fraud) is more important than false positive (not-fraud classified as fraud), so we use Recall as the metric of evaluation. (Recall = TP / (TP + FN)).
  • For the imbalanced dataset, AUCROC gives overly optimistic metric, instead we should use precision_recall_curve and after looking at the curve we should choose the value that we want for precision and recall.
  • We should also note that precision and recall does not involve TN, so we should use them only when specificity (TNR = TN/(TN+FP)) is not important.
  • For imbalanced dataset, we can use F_beta metric. If both precision and recall are equally important, we can use F1-score. If we consider recall beta times more important than precision, we can use F_beta = (1+beta^2) PR/(beta^P + R) where P is precision and R is recall. (Mnemonic: Look at the denominator and remember that Recall is beta^2 time important than Precision). (Common values are 2 and 0.5. If beta is 2, recall is twice important than precision.)
  • We should also note that F_beta depends on Precision and Recall only. It does not depend on TN (true negative), so for imbalanced classification, better metric could be MCC (Mathew's Correlation Coefficient.)

Resampling Techniques

  • Our dataset is imbalanced, we can try two sampling: undersampling and oversampling.
  • Under-sampling. (We have low number of frauds, choose randomly same number of non-frauds.)
  • Oversampling SMOTE method. Used external library imblearn.

Best Model So Far

Model Description Accuracy Precision Recall F1 AUC Untrue Frauds Missed Frauds
keras 1 layer, class_weight, early_stopping, scikit api 0.987939 0.111989 0.867347 0.198366 0.927747 674 13
cb_tuned pycaret fold=5 0.9996 0.9659 0.7865 0.9667 0.8642
catboost seed=100,depth=6,iter=1k 0.999631 1.000000 0.785714 0.880000 0.892857 0 21

Undersampling

Recall for all Classifiers with Grid Search for Undersampled Data


SMOTE Oversampling: Logistic Regression


Anomaly Detection Methods

Model Description Accuracy Precision Recall F1(Weighted)
Isolation Forest default 0.997384 0.261682 0.285714 0.997442
Local Outlier Factor default 0.996331 0.025641 0.030612 0.996493

Gradient Boosting Modelling

Model Description Accuracy Precision Recall F1 AUC
lightgbm grid search optuna 0.999315 0.873418 0.704082 0.779661 0.851953
lightgbm default 0.997367 0.275862 0.326531 0.299065 0.662527
Xgboost default, imbalanced 0.999263 0.850000 0.693878 0.764045 0.846833
Xgboost default, undersampling 0.999263 0.850000 0.693878 0.764045 0.846833
Xgboost n_estimators=150, imbalanced 0.999263 0.850000 0.693878 0.764045 0.846833
Xgboost undersample, hpo1 0.999298 0.881579 0.683673 0.770115 0.841758
Xgboost imbalanced, hpo 0.999245 0.898551 0.632653 0.742515 0.816265
xgboost grid search optuna 0.999333 0.875000 0.714286 0.786517 0.857055
catboost seed=100,depth=6,iter=1k 0.999631 1.000000 0.785714 0.880000 0.892857

Automatic Modelling: pycaret

Model Description Accuracy AUC Recall Precision F1 Kappa
cb_tuned fold=5 0.9996 0.9659 0.7865 0.9667 0.8642 0.8639
lda_tuned fold=5 0.9995 0.9833 0.7760 0.9217 0.8423 0.8420
xgb default 0.9994 0.9585 0.7345 0.9102 0.8047 0.8044
cb default 0.9995 0.9554 0.7345 0.9548 0.8215 0.8212
lda default 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657
xgb_tuned tuned 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657
lda_tuned n_iter=100,fold=10 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657

Big Data Modelling: PySpark


Deep Learning Models

Model Description Accuracy Precision Recall F1 AUC Missed Frauds Untrue Frauds
keras 3 layers, 2 dropouts, class_weight 0.983744 0.081818 0.826531 0.148897 0.905273 17 909
keras 1 layer, dropout, early_stopping 0.984990 0.090811 0.857143 0.164223 0.921177 14 841
keras 1 layer, dropout, steps_per_epoch, oversampling 0.982796 0.080000 0.857143 0.146341 0.920077 14 966
keras 1 layer, class_weight, early_stopping, scikit api 0.987939 0.111989 0.867347 0.198366 0.927747 13 674

References

About

Binary Classification: Logistic Regression SMOTE, Isolation Forest, LOF, xgboost, lightgbm, catboost, pyspark, keras

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published