-
-
Notifications
You must be signed in to change notification settings - Fork 215
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #626 from codewithpiyushh/main
Created Email Spam Classifier
- Loading branch information
Showing
14 changed files
with
8,443 additions
and
0 deletions.
There are no files selected for viewing
5,573 changes: 5,573 additions & 0 deletions
5,573
End to End Email Spam Classifier/Dataset/spam.csv
Large diffs are not rendered by default.
Oops, something went wrong.
2,595 changes: 2,595 additions & 0 deletions
2,595
End to End Email Spam Classifier/Model/Email_spam_Classifier2.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
## ** End to End Email Spam Classifier** | ||
|
||
### 🎯 **Goal** | ||
|
||
The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍 | ||
|
||
### 🧵 **Dataset** | ||
|
||
the Dataset used int the project is taken from kaggle spam Dataset | ||
[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset] | ||
|
||
### 🧮 **What I had done!** | ||
|
||
In email spam classification project, I evaluated various models and found that Support Vector Machines (SVM) achieved the highest accuracy. SVMs are powerful classifiers that excel in separating data into distinct classes by finding an optimal hyperplane. Their success lies in their ability to handle both linear and non-linear data through kernel functions. By maximizing the margin between classes, SVMs effectively minimize misclassifications. Consequently, your results highlight SVMs as a robust choice for spam detection, contributing to better email filtering and enhanced user experience. 📧🔍🚀 | ||
|
||
### 🧾 **Description** | ||
|
||
Email spam classification refers to the process of automatically categorizing incoming emails as either spam (unsolicited, irrelevant, or potentially harmful) or legitimate (also known as “ham”). The goal is to enhance user experience by organizing and prioritizing emails, reducing annoyance, and protecting users from malicious content. Various machine learning models, including Support Vector Machines (SVMs), are commonly used for accurate spam detection. 📧🔍🚀 | ||
|
||
### 🚀 **Models Implemented** | ||
|
||
I have used various models like Linear Regression , K-Neighbors Classifier , Naive bayes ,Decision Tree Classifier and Support Vector Machine for training and in all of them the maximum accuracy of the models turns out to be the Support Vector Machine | ||
|
||
### 📚 **Libraries Needed** | ||
|
||
- scikit-learn: Machine learning library for model training and evaluation. | ||
- pandas: Data manipulation library for preprocessing and data handling. | ||
- numpy: Numerical computing library for mathematical operations. | ||
- matplotlib, seaborn, plotly: Visualization libraries for data exploration and analysis. | ||
- streamlit: for deploying the model | ||
|
||
### 📊 **Exploratory Data Analysis Results** | ||
|
||
![graphs](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/08ba4a85-6ed3-409b-bc43-52362d2ffd17) | ||
|
||
|
||
### 📈 **Performance of the Models based on the Accuracy Scores** | ||
|
||
- *Logistic Regression*: | ||
- Best Parameters: {'C': 10, 'penalty': 'l2'} | ||
- Accuracy: 0.973 | ||
- Precision: 0.973 | ||
|
||
- *K-Neighbors Classifier*: | ||
- Best Parameters: {'n_neighbors': 3} | ||
- Accuracy: 0.999 | ||
- Precision: 0.906 | ||
|
||
- *Naive bayes*: | ||
- Best Parameters: {'min_samples_split': 5, 'max_depth': 20} | ||
- Accuracy: 0.977 | ||
- Precision: 0.977 | ||
|
||
- *Decision Tree Classifier*: | ||
- Best Parameters: {'max_depth': 7} | ||
- Accuracy: 0.937 | ||
- Precision: 0.933 | ||
|
||
- *Support Vector Machine*: | ||
- Best Parameters: {'gamma': 0.1, 'kernel': 'linear'} | ||
- Accuracy: 0.978 | ||
- Precision: 0.978 | ||
|
||
![model-accuracy](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/a283987a-523d-4ee2-8484-1ff31a41eb83) | ||
|
||
![wordcloud-1](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/2b6fbade-72ea-442a-afe7-000b93975e92) | ||
|
||
![wordcloud-2](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/0972dc14-6d1c-4d42-a0ef-d3ae9dde064f) | ||
|
||
|
||
### 📢 **Conclusion** | ||
|
||
In conclusion, email spam classification significantly improves user experience by organizing emails, detecting and filtering out spam, enhancing security, and ensuring legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍 | ||
|
||
### ✒️ **Your Signature** | ||
|
||
Piyush | ||
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/) |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# 🧠 End to End Email Spam Classifier | ||
|
||
![interface](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/db86a574-6e9c-47ee-bdbf-5966b8c81773) | ||
|
||
## 📝 Abstract | ||
|
||
Email spam classification involves automatically identifying and sorting emails into two categories: spam (unwanted messages) and ham (legitimate messages). Techniques like analyzing email content, user behavior, and machine learning models help achieve this. The goal is to keep spam out of your inbox and ensure you receive important emails . | ||
|
||
|
||
## 🔍 Methodology | ||
|
||
1. **Importing Libraries:** | ||
- Libraries such as NumPy, Pandas, Sklearn, and others are imported for data manipulation, visualization, and Machine learning model building. | ||
|
||
2. **Loading the Dataset:** | ||
- The dataset containing the multiple rows of spam or not spam message | ||
|
||
3. **Data Preprocessing:** | ||
Prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms. | ||
|
||
4. **Training the Models:** | ||
- Each model is compiled using the Support vector machines | ||
- The models are trained on the training dataset and evaluation is done. | ||
|
||
5. **Model Performance Analysis:** | ||
- Training and validation loss and accuracy are plotted to visualize the models' performance. | ||
|
||
6. **Model prediction:** | ||
- The model is then given a test dataset to check the accuracy and precision of the model with | ||
|
||
7. **Deploy:** | ||
- By using the streamlit library the model is deployed | ||
|
||
|
||
**Data and Model File Download:** | ||
the Dataset used int the project is taken from kaggle spam Dataset | ||
[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset] | ||
|
||
|
||
### Project Directory Structure | ||
``` | ||
End to End Email Spam Classifier | ||
|- Dataset | ||
|- spam.csv | ||
|- README.md | ||
|- Images | ||
|- graphs | ||
|- interface | ||
|- model live recording | ||
|- model accuracy | ||
|- model precision | ||
|- worldcloud-1 | ||
|- worldcloud-2 | ||
|- Model | ||
|- Email_Spam_Classifier2.ipynb | ||
|- README.md | ||
|- model.pkl | ||
|- vectorizer.pkl | ||
|- Web App | ||
|- openapp.py | ||
|- README.md | ||
|- requirements.txt | ||
``` | ||
## 🙌 Acknowledgments | ||
|
||
The authors would like to acknowledge the contributions of the research community in the field of Email Spam Classifier using machine learning. The open-source datasets and repositories have been instrumental in the development of this project | ||
|
||
Citations: | ||
[1] [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset] | ||
[2] [https://scikit-learn.org/stable/modules/naive_bayes.html] | ||
[3] [https://scikit-learn.org/stable/modules/svm.html] | ||
[4] [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html] | ||
[5] [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html] | ||
[6] [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html] | ||
|
||
|
||
## How to Use | ||
Requirements: Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file. | ||
|
||
Download Data: Download the spam.csv Dataset from Kaggle mentioned in the dataset section of the project. | ||
|
||
Run the Jupyter Notebook: Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment. | ||
|
||
Training and Evaluation: Train the models using the provided data and evaluate their performance using metrics such as accuracy and loss. | ||
|
||
Interpret Results: Analyze the model's performance using the visualizations and metrics provided in the notebook. | ||
|
||
Feel free to reach out if you encounter any issues or need further assistance with running the notebook. | ||
|
||
## Connect with me | ||
Piyush | ||
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
## End to End Email Spam Classifier | ||
|
||
### Goal 🎯 | ||
The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍 | ||
|
||
### Model(s) used for the Web App 🧮 | ||
The model that i used here is support vector machine with accuracy of 97% and for deploying the model i have used the Streamlit library | ||
|
||
### Video Demonstration 🎥 | ||
|
||
https://github.com/codewithpiyushh/ML-Crate/assets/154052068/856c2d97-585a-438b-8901-765fd90d7d0b | ||
|
||
### Signature ✒️ | ||
Piyush | ||
|
||
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
import streamlit as st | ||
import pickle | ||
import nltk | ||
import string | ||
from nltk.corpus import stopwords | ||
from nltk.stem.porter import PorterStemmer | ||
import os | ||
|
||
# Download required NLTK data only once | ||
nltk.download('punkt', quiet=True) | ||
nltk.download('stopwords', quiet=True) | ||
|
||
# Initialize the PorterStemmer | ||
ps = PorterStemmer() | ||
|
||
# Function to preprocess and transform the text | ||
from nltk.stem.porter import PorterStemmer | ||
from nltk.corpus import stopwords | ||
import string | ||
def transform_text(text): | ||
cleaned_text = text.lower() | ||
|
||
tokens = nltk.word_tokenize(cleaned_text) | ||
|
||
tokens = [token for token in tokens if token.isalnum()] | ||
|
||
stop_words = set(stopwords.words('english')) | ||
|
||
tokens = [token for token in tokens if token not in stop_words] | ||
|
||
ps = PorterStemmer() | ||
stemmed_tokens = [ps.stem(token) for token in tokens] | ||
|
||
cleaned_text = " ".join(stemmed_tokens) | ||
return cleaned_text | ||
|
||
def predict_spam(vector_input, model, tfidf): | ||
dense_input = vector_input.toarray() # Convert sparse matrix to dense array | ||
prediction = model.predict(dense_input)[0] | ||
return "Spam" if prediction == 1 else "Not Spam" | ||
|
||
# Verify the presence of files before loading | ||
vectorizer_path = '../Model/vectorizer.pkl' | ||
model_path = '../Model/model.pkl' | ||
|
||
if not os.path.exists(vectorizer_path): | ||
st.error(f"Error: {vectorizer_path} file not found.") | ||
st.stop() | ||
|
||
if not os.path.exists(model_path): | ||
st.error(f"Error: {model_path} file not found.") | ||
st.stop() | ||
|
||
# Load the vectorizer and model with error handling | ||
try: | ||
with open(vectorizer_path, 'rb') as file: | ||
tfidf = pickle.load(file) | ||
except Exception as e: | ||
st.error(f"Error loading {vectorizer_path}: {e}") | ||
st.stop() | ||
|
||
try: | ||
with open(model_path, 'rb') as file: | ||
model = pickle.load(file) | ||
except Exception as e: | ||
st.error(f"Error loading {model_path}: {e}") | ||
st.stop() | ||
|
||
# Streamlit UI | ||
st.title("Email Spam Classifier") | ||
input_sms = st.text_input("Enter the message") | ||
|
||
if st.button("Predict"): | ||
if input_sms: | ||
try: | ||
transformed_sms = transform_text(input_sms) | ||
vector_input = tfidf.transform([transformed_sms]) | ||
result = predict_spam(vector_input, model, tfidf) # Pass entire vector_input | ||
|
||
if result == 0: | ||
st.header("Not Spam") | ||
else: | ||
st.header("Spam") | ||
except Exception as e: | ||
st.error(f"Error during prediction: {e}") | ||
else: | ||
st.error("Please enter a message to classify.") |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.