Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spam email detection #379

Merged
merged 13 commits into from
Dec 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Email Spam Classification/Datasets/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data
*Dataset from Kaggle*
Binary file not shown.
9 changes: 9 additions & 0 deletions Email Spam Classification/Images/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
This folder contains jpeg files of EDA results

If there is the very high different between the positive values and negative values, then we can say our dataset in Imbalance Dataset, as is case here

The wordclouds show more popular word-stems in spam and ham mail

The ham message length on average is less than spam message length.


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions Email Spam Classification/Models/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Spam Email Classification https://github.com/World-of-ML/DL-Simplified/issues/340

Full name : Aindree Chatterjee

GitHub Profile Link : https://github.com/aindree-2005

Email ID : [email protected]

Program : CodePeak

Approach for this Project :


**LSTM (Long Short-Term Memory):**

**Type:** LSTM is a type of recurrent neural network (RNN).
**Usage:** It is used for sequential data and is particularly effective in tasks where context or order of the data is important, such as time series prediction or natural language processing.
**Strengths:** LSTMs are designed to capture long-term dependencies and can be effective in handling sequences of variable length.
**Limitations:** LSTMs may struggle with capturing very long-term dependencies, and they can be computationally expensive.

**BERT (Bidirectional Encoder Representations from Transformers):**

**Type:** BERT is based on the transformer architecture.
**Usage:** BERT is specifically designed for natural language understanding tasks, such as question answering, sentiment analysis, and text classification. It has been pre-trained on large amounts of text data and can be fine-tuned for specific tasks.
**Strengths:** BERT models excel in capturing contextual information and have achieved state-of-the-art results in a wide range of NLP tasks. They can understand the meaning of words in context and handle bidirectional dependencies well.
**Limitations:** BERT models are computationally expensive, and they require a large amount of pre-training data. Fine-tuning on specific tasks is necessary for optimal performance.

**My Conclusion**
LSTM is preferred over BERT for spam email detection here as it can capture sequential patterns in text. Spam emails often exhibit specific linguistic structures and patterns, and LSTMs excel in modeling sequential dependencies. BERT, designed for contextual understanding, might be overkill for simpler spam detection tasks, where the order of words matters less than overall patterns

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions Email Spam Classification/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
nltk
spacy
keras
tensorflow
pandas
sklearn
matplotlib

These are all the libraries. Steps to install and import are given in the notebook