Merge pull request #626 from codewithpiyushh/main

Created Email Spam Classifier
abhisheks008 · Jun 6, 2024 · 0c3df45 · 0c3df45
2 parents a5a6bff + 5f734de
commit 0c3df45
Show file tree

Hide file tree

Showing 14 changed files with 8,443 additions and 0 deletions.
diff --git a/End to End Email Spam Classifier/Dataset/spam.csv b/End to End Email Spam Classifier/Dataset/spam.csv
diff --git a/End to End Email Spam Classifier/Model/Email_spam_Classifier2.ipynb b/End to End Email Spam Classifier/Model/Email_spam_Classifier2.ipynb
diff --git a/End to End Email Spam Classifier/Model/README.md b/End to End Email Spam Classifier/Model/README.md
@@ -0,0 +1,78 @@
+## ** End to End Email Spam Classifier**
+
+### 🎯 **Goal**
+
+The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍
+
+### 🧵 **Dataset**
+
+the Dataset used int the project is taken from kaggle spam Dataset
+[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]
+
+### 🧮 **What I had done!**
+
+ In email spam classification project, I evaluated various models and found that Support Vector Machines (SVM) achieved the highest accuracy. SVMs are powerful classifiers that excel in separating data into distinct classes by finding an optimal hyperplane. Their success lies in their ability to handle both linear and non-linear data through kernel functions. By maximizing the margin between classes, SVMs effectively minimize misclassifications. Consequently, your results highlight SVMs as a robust choice for spam detection, contributing to better email filtering and enhanced user experience. 📧🔍🚀
+
+### 🧾 **Description**
+
+Email spam classification refers to the process of automatically categorizing incoming emails as either spam (unsolicited, irrelevant, or potentially harmful) or legitimate (also known as “ham”). The goal is to enhance user experience by organizing and prioritizing emails, reducing annoyance, and protecting users from malicious content. Various machine learning models, including Support Vector Machines (SVMs), are commonly used for accurate spam detection. 📧🔍🚀
+
+### 🚀 **Models Implemented**
+
+I have used various models like Linear Regression , K-Neighbors Classifier , Naive bayes ,Decision Tree Classifier and Support Vector Machine for training and in all of them the maximum accuracy of the models turns out to be the Support Vector Machine
+
+### 📚 **Libraries Needed**
+
+- scikit-learn: Machine learning library for model training and evaluation.
+- pandas: Data manipulation library for preprocessing and data handling.
+- numpy: Numerical computing library for mathematical operations.
+- matplotlib, seaborn, plotly: Visualization libraries for data exploration and analysis.
+- streamlit: for deploying the model 
+
+### 📊 **Exploratory Data Analysis Results**
+
+![graphs](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/08ba4a85-6ed3-409b-bc43-52362d2ffd17)
+
+
+### 📈 **Performance of the Models based on the Accuracy Scores**
+
+- *Logistic Regression*:
+  - Best Parameters: {'C': 10, 'penalty': 'l2'}
+  - Accuracy: 0.973
+  - Precision: 0.973
+
+- *K-Neighbors Classifier*:
+  - Best Parameters: {'n_neighbors': 3}
+  - Accuracy: 0.999
+  - Precision: 0.906
+
+- *Naive bayes*:
+  - Best Parameters: {'min_samples_split': 5, 'max_depth': 20}
+  - Accuracy: 0.977
+  - Precision: 0.977
+
+- *Decision Tree Classifier*:
+  - Best Parameters: {'max_depth': 7}
+  - Accuracy: 0.937
+  - Precision: 0.933
+
+- *Support Vector Machine*:
+  - Best Parameters: {'gamma': 0.1, 'kernel': 'linear'}  
+  - Accuracy: 0.978
+  - Precision: 0.978
+
+![model-accuracy](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/a283987a-523d-4ee2-8484-1ff31a41eb83)
+
+![wordcloud-1](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/2b6fbade-72ea-442a-afe7-000b93975e92)
+
+![wordcloud-2](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/0972dc14-6d1c-4d42-a0ef-d3ae9dde064f)
+
+
+### 📢 **Conclusion**
+
+In conclusion, email spam classification significantly improves user experience by organizing emails, detecting and filtering out spam, enhancing security, and ensuring legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍
+
+### ✒️ **Your Signature**
+
+Piyush 
+[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)
diff --git a/End to End Email Spam Classifier/Model/model.pkl b/End to End Email Spam Classifier/Model/model.pkl
diff --git a/End to End Email Spam Classifier/Model/vectorizer.pkl b/End to End Email Spam Classifier/Model/vectorizer.pkl
diff --git a/End to End Email Spam Classifier/README.md b/End to End Email Spam Classifier/README.md
@@ -0,0 +1,94 @@
+# 🧠 End to End Email Spam Classifier
+
+![interface](https://github.com/codewithpiyushh/ML-Crate/assets/154052068/db86a574-6e9c-47ee-bdbf-5966b8c81773)
+
+## 📝 Abstract
+
+Email spam classification involves automatically identifying and sorting emails into two categories: spam (unwanted messages) and ham (legitimate messages). Techniques like analyzing email content, user behavior, and machine learning models help achieve this. The goal is to keep spam out of your inbox and ensure you receive important emails .
+
+
+## 🔍 Methodology
+
+1. **Importing Libraries:**  
+   - Libraries such as NumPy, Pandas, Sklearn, and others are imported for data manipulation, visualization, and Machine learning model building.
+
+2. **Loading the Dataset:**
+   - The dataset containing the multiple rows of spam or not spam message
+
+3. **Data Preprocessing:**
+   Prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms.
+
+4. **Training the Models:**
+   - Each model is compiled using the Support vector machines
+   - The models are trained on the training dataset and evaluation is done.
+
+5. **Model Performance Analysis:**
+   - Training and validation loss and accuracy are plotted to visualize the models' performance.
+
+6. **Model prediction:**
+   - The model is then given a test dataset to check the accuracy and precision of the model with
+
+7. **Deploy:**
+   - By using the streamlit library the model is deployed
+
+
+**Data and Model File Download:**
+the Dataset used int the project is taken from kaggle spam Dataset
+[https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]
+
+
+### Project Directory Structure
+```
+End to End Email Spam Classifier
+|- Dataset
+  |- spam.csv
+  |- README.md
+|- Images
+  |- graphs
+  |- interface
+  |- model live recording
+  |- model accuracy
+  |- model precision
+  |- worldcloud-1
+  |- worldcloud-2
+|- Model
+  |- Email_Spam_Classifier2.ipynb
+  |- README.md
+  |- model.pkl
+  |- vectorizer.pkl
+|- Web App
+  |- openapp.py
+  |- README.md
+|- requirements.txt
+```
+## 🙌 Acknowledgments
+
+The authors would like to acknowledge the contributions of the research community in the field of Email Spam Classifier using machine learning. The open-source datasets and repositories have been instrumental in the development of this project
+
+Citations:
+[1] [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]
+[2] [https://scikit-learn.org/stable/modules/naive_bayes.html]
+[3] [https://scikit-learn.org/stable/modules/svm.html]
+[4] [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html]
+[5] [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html]
+[6] [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html]
+
+
+## How to Use
+Requirements: Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file.
+
+Download Data: Download the spam.csv Dataset from Kaggle mentioned in the dataset section of the project.
+
+Run the Jupyter Notebook: Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment.
+
+Training and Evaluation: Train the models using the provided data and evaluate their performance using metrics such as accuracy and loss.
+
+Interpret Results: Analyze the model's performance using the visualizations and metrics provided in the notebook.
+
+Feel free to reach out if you encounter any issues or need further assistance with running the notebook.
+
+## Connect with me 
+Piyush
+[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)
+
+
diff --git a/End to End Email Spam Classifier/Web app/README.md b/End to End Email Spam Classifier/Web app/README.md
@@ -0,0 +1,16 @@
+## End to End Email Spam Classifier 
+
+### Goal 🎯
+The goal of email spam classification is to enhance user experience by organizing and prioritizing emails, detecting and filtering out spam, improving security, and ensuring that legitimate messages receive attention. It streamlines communication, reduces annoyance, and protects users from malicious content. 📧🔍
+
+### Model(s) used for the Web App 🧮
+The model that i used here is support vector machine with accuracy of 97% and for deploying the model i have used the Streamlit library
+
+### Video Demonstration 🎥
+
+https://github.com/codewithpiyushh/ML-Crate/assets/154052068/856c2d97-585a-438b-8901-765fd90d7d0b
+
+### Signature ✒️
+Piyush 
+
+[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/piyushhh-singhh/)
diff --git a/End to End Email Spam Classifier/Web app/openapp.py b/End to End Email Spam Classifier/Web app/openapp.py
@@ -0,0 +1,87 @@
+import streamlit as st
+import pickle
+import nltk
+import string
+from nltk.corpus import stopwords
+from nltk.stem.porter import PorterStemmer
+import os
+
+# Download required NLTK data only once
+nltk.download('punkt', quiet=True)
+nltk.download('stopwords', quiet=True)
+
+# Initialize the PorterStemmer
+ps = PorterStemmer()
+
+# Function to preprocess and transform the text
+from nltk.stem.porter import PorterStemmer
+from nltk.corpus import stopwords
+import string
+def transform_text(text):
+    cleaned_text = text.lower()
+
+    tokens = nltk.word_tokenize(cleaned_text)
+
+    tokens = [token for token in tokens if token.isalnum()]
+
+    stop_words = set(stopwords.words('english'))
+
+    tokens = [token for token in tokens if token not in stop_words]
+
+    ps = PorterStemmer()
+    stemmed_tokens = [ps.stem(token) for token in tokens]
+
+    cleaned_text = " ".join(stemmed_tokens)
+    return cleaned_text
+
+def predict_spam(vector_input, model, tfidf):
+    dense_input = vector_input.toarray()  # Convert sparse matrix to dense array
+    prediction = model.predict(dense_input)[0]
+    return "Spam" if prediction == 1 else "Not Spam"
+
+# Verify the presence of files before loading
+vectorizer_path = '../Model/vectorizer.pkl'
+model_path = '../Model/model.pkl'
+
+if not os.path.exists(vectorizer_path):
+    st.error(f"Error: {vectorizer_path} file not found.")
+    st.stop()
+
+if not os.path.exists(model_path):
+    st.error(f"Error: {model_path} file not found.")
+    st.stop()
+
+# Load the vectorizer and model with error handling
+try:
+    with open(vectorizer_path, 'rb') as file:
+        tfidf = pickle.load(file)
+except Exception as e:
+    st.error(f"Error loading {vectorizer_path}: {e}")
+    st.stop()
+
+try:
+    with open(model_path, 'rb') as file:
+        model = pickle.load(file)
+except Exception as e:
+    st.error(f"Error loading {model_path}: {e}")
+    st.stop()
+
+# Streamlit UI
+st.title("Email Spam Classifier")
+input_sms = st.text_input("Enter the message")
+
+if st.button("Predict"):
+    if input_sms:
+        try:
+            transformed_sms = transform_text(input_sms)
+            vector_input = tfidf.transform([transformed_sms])
+            result = predict_spam(vector_input, model, tfidf)  # Pass entire vector_input
+
+            if result == 0:
+                st.header("Not Spam")
+            else:
+                st.header("Spam")
+        except Exception as e:
+            st.error(f"Error during prediction: {e}")
+    else:
+        st.error("Please enter a message to classify.")
diff --git a/End to End Email Spam Classifier/images/graphs.png b/End to End Email Spam Classifier/images/graphs.png
diff --git a/End to End Email Spam Classifier/images/interface.png b/End to End Email Spam Classifier/images/interface.png
diff --git a/End to End Email Spam Classifier/images/model-accuracy.png b/End to End Email Spam Classifier/images/model-accuracy.png
diff --git a/End to End Email Spam Classifier/images/model-precision.png b/End to End Email Spam Classifier/images/model-precision.png
diff --git a/End to End Email Spam Classifier/images/wordcloud-1.png b/End to End Email Spam Classifier/images/wordcloud-1.png
diff --git a/End to End Email Spam Classifier/images/wordcloud-2.png b/End to End Email Spam Classifier/images/wordcloud-2.png