Code : Importing all the necessary Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from matplotlib import gridspec
Code : Loading the Data
data = pd.read_csv("credit.csv")
Code : Understanding the Data
data.head() Code : Describing the Data
data = data.sample(frac = 0.1, random_state = 48) print(data.shape) print(data.describe())
Code : Imbalance in the data Time to explain the data we are dealing with.
fraud = data[data['Class'] == 1] valid = data[data['Class'] == 0] outlierFraction = len(fraud)/float(len(valid)) print(outlierFraction) print('Fraud Cases: {}'.format(len(data[data['Class'] == 1]))) print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))
Lets first apply our models without balancing it and if we don’t get a good accuracy then we can find a way to balance this dataset. But first, let’s implement the model without it and will balance the data only if needed.
Code : Print the amount details for Fraudulent Transaction
print(“Amount details of the fraudulent transaction”) fraud.Amount.describe()
Code : Print the amount details for Normal Transaction
print(“details of valid transaction”) valid.Amount.describe()
Code : Plotting the Correlation Matrix The correlation matrix graphically gives us an idea of how features correlate with each other and can help us predict what are the features that are most relevant for the prediction.
corrmat = data.corr() fig = plt.figure(figsize = (12, 9)) sns.heatmap(corrmat, vmax = .8, square = True) plt.show()
Code : Separating the X and the Y values Dividing the data into inputs parameters and outputs value format
X = data.drop(['Class'], axis = 1) Y = data["Class"] print(X.shape) print(Y.shape)
xData = X.values yData = Y.values
Training and Testing Data Bifurcation We will be dividing the dataset into two main groups. One for training the model and the other for Testing our trained model’s performance.
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split( xData, yData, test_size = 0.2, random_state = 42)
Code : Building a Random Forest Model using skicit learn
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier() rfc.fit(xTrain, yTrain)
yPred = rfc.predict(xTest)
Code : Building all kinds of evaluating parameters
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
n_outliers = len(fraud) n_errors = (yPred != yTest).sum() print("The model used is Random Forest classifier")
acc = accuracy_score(yTest, yPred) print("The accuracy is {}".format(acc))
prec = precision_score(yTest, yPred) print("The precision is {}".format(prec))
rec = recall_score(yTest, yPred) print("The recall is {}".format(rec))
f1 = f1_score(yTest, yPred) print("The F1-Score is {}".format(f1))
MCC = matthews_corrcoef(yTest, yPred) print("The Matthews correlation coefficient is{}".format(MCC))
Code : Visulalizing the Confusion Matrix
LABELS = ['Normal', 'Fraud']
conf_matrix = confusion_matrix(yTest, yPred)
plt.figure(figsize =(12, 12))
sns.heatmap(conf_matrix, xticklabels = LABELS,
yticklabels = LABELS, annot = True, fmt ="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()