abhisheks008 · abhisheks008 · Dec 10, 2023 · Dec 10, 2023 · Dec 10, 2023 · Dec 10, 2023
diff --git a/Email Spam Classification/Datasets/Readme.md b/Email Spam Classification/Datasets/Readme.md
@@ -0,0 +1,2 @@
+https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data
+*Dataset from Kaggle*
diff --git a/Email Spam Classification/Datasets/archive (3).zip b/Email Spam Classification/Datasets/archive (3).zip
diff --git a/Email Spam Classification/Images/Readme.md b/Email Spam Classification/Images/Readme.md
@@ -0,0 +1,9 @@
+This folder contains jpeg files of EDA results 
+
+ If there is the very high different between the positive values and negative values, then we can say our dataset in Imbalance Dataset, as is case here
+
+ The wordclouds show more popular word-stems in spam and ham mail
+
+ The ham message length on average is less than spam message length.
+
+
diff --git a/Email Spam Classification/Images/Screenshot (237).png b/Email Spam Classification/Images/Screenshot (237).png
diff --git a/Email Spam Classification/Images/Screenshot (238).png b/Email Spam Classification/Images/Screenshot (238).png
diff --git a/Email Spam Classification/Images/Screenshot (239).png b/Email Spam Classification/Images/Screenshot (239).png
diff --git a/Email Spam Classification/Images/Screenshot (240).png b/Email Spam Classification/Images/Screenshot (240).png
diff --git a/Email Spam Classification/Models/Readme.md b/Email Spam Classification/Models/Readme.md
@@ -0,0 +1,29 @@
+# Spam Email Classification https://github.com/World-of-ML/DL-Simplified/issues/340
+
+Full name : Aindree Chatterjee
+
+GitHub Profile Link : https://github.com/aindree-2005
+
+Email ID : [email protected]
+
+Program : CodePeak
+
+Approach for this Project :
+
+
+**LSTM (Long Short-Term Memory):**
+
+**Type:** LSTM is a type of recurrent neural network (RNN).
+**Usage:** It is used for sequential data and is particularly effective in tasks where context or order of the data is important, such as time series prediction or natural language processing.
+**Strengths:** LSTMs are designed to capture long-term dependencies and can be effective in handling sequences of variable length.
+**Limitations:** LSTMs may struggle with capturing very long-term dependencies, and they can be computationally expensive.
+
+**BERT (Bidirectional Encoder Representations from Transformers):**
+
+**Type:** BERT is based on the transformer architecture.
+**Usage:** BERT is specifically designed for natural language understanding tasks, such as question answering, sentiment analysis, and text classification. It has been pre-trained on large amounts of text data and can be fine-tuned for specific tasks.
+**Strengths:** BERT models excel in capturing contextual information and have achieved state-of-the-art results in a wide range of NLP tasks. They can understand the meaning of words in context and handle bidirectional dependencies well.
+**Limitations:** BERT models are computationally expensive, and they require a large amount of pre-training data. Fine-tuning on specific tasks is necessary for optimal performance.
+
+**My Conclusion**
+LSTM is  preferred over BERT for spam email detection here as it can capture sequential patterns in text. Spam emails often exhibit specific linguistic structures and patterns, and LSTMs excel in modeling sequential dependencies. BERT, designed for contextual understanding, might be overkill for simpler spam detection tasks, where the order of words matters less than overall patterns
diff --git a/Email Spam Classification/Models/email-spam-classification (1).ipynb b/Email Spam Classification/Models/email-spam-classification (1).ipynb
diff --git a/Email Spam Classification/requirements.txt b/Email Spam Classification/requirements.txt
@@ -0,0 +1,9 @@
+nltk
+spacy
+keras
+tensorflow
+pandas
+sklearn
+matplotlib
+
+These are all the libraries. Steps to install and import are given in the notebook
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data
		Dataset from Kaggle