Skip to content

Commit

Permalink
Merge pull request #626 from codewithpiyushh/main
Browse files Browse the repository at this point in the history
Created Email Spam Classifier
  • Loading branch information
abhisheks008 authored Jun 6, 2024
2 parents a5a6bff + 5f734de commit 0c3df45
Show file tree
Hide file tree
Showing 14 changed files with 8,443 additions and 0 deletions.
5,573 changes: 5,573 additions & 0 deletions End to End Email Spam Classifier/Dataset/spam.csv

Large diffs are not rendered by default.

2,595 changes: 2,595 additions & 0 deletions End to End Email Spam Classifier/Model/Email_spam_Classifier2.ipynb

Large diffs are not rendered by default.

78 changes: 78 additions & 0 deletions End to End Email Spam Classifier/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
## ** End to End Email Spam Classifier**

### 🎯 **Goal**

The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍

### 🧵 **Dataset**

the Dataset used int the project is taken from kaggle spam Dataset
[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]

### 🧮 **What I had done!**

In email spam classification project, I evaluated various models and found that Support Vector Machines (SVM) achieved the highest accuracy. SVMs are powerful classifiers that excel in separating data into distinct classes by finding an optimal hyperplane. Their success lies in their ability to handle both linear and non-linear data through kernel functions. By maximizing the margin between classes, SVMs effectively minimize misclassifications. Consequently, your results highlight SVMs as a robust choice for spam detection, contributing to better email filtering and enhanced user experience. 📧🔍🚀

### 🧾 **Description**

Email spam classification refers to the process of automatically categorizing incoming emails as either spam (unsolicited, irrelevant, or potentially harmful) or legitimate (also known as “ham”). The goal is to enhance user experience by organizing and prioritizing emails, reducing annoyance, and protecting users from malicious content. Various machine learning models, including Support Vector Machines (SVMs), are commonly used for accurate spam detection. 📧🔍🚀

### 🚀 **Models Implemented**

I have used various models like Linear Regression , K-Neighbors Classifier , Naive bayes ,Decision Tree Classifier and Support Vector Machine for training and in all of them the maximum accuracy of the models turns out to be the Support Vector Machine

### 📚 **Libraries Needed**

- scikit-learn: Machine learning library for model training and evaluation.
- pandas: Data manipulation library for preprocessing and data handling.
- numpy: Numerical computing library for mathematical operations.
- matplotlib, seaborn, plotly: Visualization libraries for data exploration and analysis.
- streamlit: for deploying the model

### 📊 **Exploratory Data Analysis Results**

![graphs](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/08ba4a85-6ed3-409b-bc43-52362d2ffd17)


### 📈 **Performance of the Models based on the Accuracy Scores**

- *Logistic Regression*:
- Best Parameters: {'C': 10, 'penalty': 'l2'}
- Accuracy: 0.973
- Precision: 0.973

- *K-Neighbors Classifier*:
- Best Parameters: {'n_neighbors': 3}
- Accuracy: 0.999
- Precision: 0.906

- *Naive bayes*:
- Best Parameters: {'min_samples_split': 5, 'max_depth': 20}
- Accuracy: 0.977
- Precision: 0.977

- *Decision Tree Classifier*:
- Best Parameters: {'max_depth': 7}
- Accuracy: 0.937
- Precision: 0.933

- *Support Vector Machine*:
- Best Parameters: {'gamma': 0.1, 'kernel': 'linear'}
- Accuracy: 0.978
- Precision: 0.978

![model-accuracy](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/a283987a-523d-4ee2-8484-1ff31a41eb83)

![wordcloud-1](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/2b6fbade-72ea-442a-afe7-000b93975e92)

![wordcloud-2](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/0972dc14-6d1c-4d42-a0ef-d3ae9dde064f)


### 📢 **Conclusion**

In conclusion, email spam classification significantly improves user experience by organizing emails, detecting and filtering out spam, enhancing security, and ensuring legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍

### ✒️ **Your Signature**

Piyush
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)
Binary file added End to End Email Spam Classifier/Model/model.pkl
Binary file not shown.
Binary file not shown.
94 changes: 94 additions & 0 deletions End to End Email Spam Classifier/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# 🧠 End to End Email Spam Classifier

![interface](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/db86a574-6e9c-47ee-bdbf-5966b8c81773)

## 📝 Abstract

Email spam classification involves automatically identifying and sorting emails into two categories: spam (unwanted messages) and ham (legitimate messages). Techniques like analyzing email content, user behavior, and machine learning models help achieve this. The goal is to keep spam out of your inbox and ensure you receive important emails .


## 🔍 Methodology

1. **Importing Libraries:**
- Libraries such as NumPy, Pandas, Sklearn, and others are imported for data manipulation, visualization, and Machine learning model building.

2. **Loading the Dataset:**
- The dataset containing the multiple rows of spam or not spam message

3. **Data Preprocessing:**
Prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms.

4. **Training the Models:**
- Each model is compiled using the Support vector machines
- The models are trained on the training dataset and evaluation is done.

5. **Model Performance Analysis:**
- Training and validation loss and accuracy are plotted to visualize the models' performance.

6. **Model prediction:**
- The model is then given a test dataset to check the accuracy and precision of the model with

7. **Deploy:**
- By using the streamlit library the model is deployed


**Data and Model File Download:**
the Dataset used int the project is taken from kaggle spam Dataset
[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]


### Project Directory Structure
```
End to End Email Spam Classifier
|- Dataset
|- spam.csv
|- README.md
|- Images
|- graphs
|- interface
|- model live recording
|- model accuracy
|- model precision
|- worldcloud-1
|- worldcloud-2
|- Model
|- Email_Spam_Classifier2.ipynb
|- README.md
|- model.pkl
|- vectorizer.pkl
|- Web App
|- openapp.py
|- README.md
|- requirements.txt
```
## 🙌 Acknowledgments

The authors would like to acknowledge the contributions of the research community in the field of Email Spam Classifier using machine learning. The open-source datasets and repositories have been instrumental in the development of this project

Citations:
[1] [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]
[2] [https://scikit-learn.org/stable/modules/naive_bayes.html]
[3] [https://scikit-learn.org/stable/modules/svm.html]
[4] [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html]
[5] [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html]
[6] [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html]


## How to Use
Requirements: Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file.

Download Data: Download the spam.csv Dataset from Kaggle mentioned in the dataset section of the project.

Run the Jupyter Notebook: Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment.

Training and Evaluation: Train the models using the provided data and evaluate their performance using metrics such as accuracy and loss.

Interpret Results: Analyze the model's performance using the visualizations and metrics provided in the notebook.

Feel free to reach out if you encounter any issues or need further assistance with running the notebook.

## Connect with me
Piyush
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)


16 changes: 16 additions & 0 deletions End to End Email Spam Classifier/Web app/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## End to End Email Spam Classifier

### Goal 🎯
The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍

### Model(s) used for the Web App 🧮
The model that i used here is support vector machine with accuracy of 97% and for deploying the model i have used the Streamlit library

### Video Demonstration 🎥

https://github.com/codewithpiyushh/ML-Crate/assets/154052068/856c2d97-585a-438b-8901-765fd90d7d0b

### Signature ✒️
Piyush

[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)
87 changes: 87 additions & 0 deletions End to End Email Spam Classifier/Web app/openapp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import streamlit as st
import pickle
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import os

# Download required NLTK data only once
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Initialize the PorterStemmer
ps = PorterStemmer()

# Function to preprocess and transform the text
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import string
def transform_text(text):
cleaned_text = text.lower()

tokens = nltk.word_tokenize(cleaned_text)

tokens = [token for token in tokens if token.isalnum()]

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

ps = PorterStemmer()
stemmed_tokens = [ps.stem(token) for token in tokens]

cleaned_text = " ".join(stemmed_tokens)
return cleaned_text

def predict_spam(vector_input, model, tfidf):
dense_input = vector_input.toarray() # Convert sparse matrix to dense array
prediction = model.predict(dense_input)[0]
return "Spam" if prediction == 1 else "Not Spam"

# Verify the presence of files before loading
vectorizer_path = '../Model/vectorizer.pkl'
model_path = '../Model/model.pkl'

if not os.path.exists(vectorizer_path):
st.error(f"Error: {vectorizer_path} file not found.")
st.stop()

if not os.path.exists(model_path):
st.error(f"Error: {model_path} file not found.")
st.stop()

# Load the vectorizer and model with error handling
try:
with open(vectorizer_path, 'rb') as file:
tfidf = pickle.load(file)
except Exception as e:
st.error(f"Error loading {vectorizer_path}: {e}")
st.stop()

try:
with open(model_path, 'rb') as file:
model = pickle.load(file)
except Exception as e:
st.error(f"Error loading {model_path}: {e}")
st.stop()

# Streamlit UI
st.title("Email Spam Classifier")
input_sms = st.text_input("Enter the message")

if st.button("Predict"):
if input_sms:
try:
transformed_sms = transform_text(input_sms)
vector_input = tfidf.transform([transformed_sms])
result = predict_spam(vector_input, model, tfidf) # Pass entire vector_input

if result == 0:
st.header("Not Spam")
else:
st.header("Spam")
except Exception as e:
st.error(f"Error during prediction: {e}")
else:
st.error("Please enter a message to classify.")
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0c3df45

Please sign in to comment.