Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧑‍💻: Deep_Learning/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/model1 Enhancement problem #52

Closed
Sakeebhasan123456 opened this issue Oct 3, 2024 · 3 comments · Fixed by #90
Assignees
Labels
Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. good first issue Good for newcomers gssoc-ext hacktoberfest Level Update level1 Status: Assigned💻 Indicates an issue has been assigned to a contributor.

Comments

@Sakeebhasan123456
Copy link
Contributor

hii @UTSAVS26 i analyzed the deep learning model and these problems come into the picture i would like to work on these problems please assign this project to me

SMS Spam Classification Project

Hidden Problems & Solutions

Identifying and addressing these problems is crucial for enhancing the performance, reliability, and usability of the SMS Spam Classification model.

# Problem Description Solution
1 Class Imbalance Not Addressed The dataset contains approximately 87.4% ham and 12.6% spam messages without balancing techniques. - Apply SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic spam samples.
- Use class weighting in algorithms to give more importance to the spam class.
2 Limited Evaluation Metrics Evaluation primarily relies on accuracy and precision. - Incorporate recall, F1-score, and ROC AUC Score to get a more comprehensive evaluation of model performance.
- Use confusion matrix visualizations.
3 Single Train-Test Split Without Cross-Validation Only a single train-test split is used for model evaluation. - Implement Stratified K-Fold Cross-Validation to ensure reliable and generalizable performance estimates.
- Utilize cross_val_score for robust metrics.
4 Suboptimal Model Selection and Lack of Hyperparameter Tuning Multiple models are used without thorough hyperparameter optimization. - Perform GridSearchCV or RandomizedSearchCV for hyperparameter tuning.
- Explore advanced models like Logistic Regression, XGBoost, or LightGBM.
5 Use of GaussianNB with High-Dimensional Sparse Data Gaussian Naive Bayes is applied to TF-IDF vectors, which are high-dimensional and sparse. - Prefer models like MultinomialNB or BernoulliNB for high-dimensional sparse data.
- Alternatively, use models that handle sparsity better, such as SVM or Random Forest.
6 Lack of Pipeline Integration Preprocessing, feature extraction, and modeling steps are handled separately. - Utilize sklearn.pipeline to chain preprocessing and modeling steps, preventing data leakage and enhancing workflow maintainability.
7 Insufficient Text Preprocessing Basic preprocessing steps are applied, but advanced techniques like handling contractions are missing. - Implement contraction expansion to maintain semantic integrity.
- Apply spelling correction and Named Entity Recognition (NER) to enhance text quality.
8 Potential Overfitting with Multinomial and BernoulliNB These models can overfit if the feature space is too large or not properly regularized. - Apply regularization techniques such as adjusting the alpha parameter.
- Reduce feature dimensionality using TruncatedSVD or SelectKBest.
9 Streamlit App Dependencies Not Managed Properly The Streamlit app comments out loading the vectorizer and model, indicating potential deployment issues. - Ensure proper saving and loading of the TF-IDF vectorizer and trained model using pickle or joblib.
- Verify file paths and manage dependencies correctly.
10 No Handling of Rare Words or Enhanced Feature Selection TF-IDF is used without additional feature selection or handling of rare/high-frequency terms. - Enhance TF-IDF parameters (max_df, min_df, ngram_range) to better capture important features.
- Implement feature selection methods like Chi-Squared or Mutual Information.


To be Mentioned while taking the issue :

  • Full name : Sakeeb hasan
  • Open Source Program name:-GSocs Extended

Happy Contributing 🚀

All the best. Enjoy your open source journey ahead. 😎

Copy link

github-actions bot commented Oct 3, 2024

🙌 Thank you for bringing this issue to our attention! We appreciate your input and will investigate it as soon as possible.

Feel free to join our community on Discord to discuss more!

@UTSAVS26 UTSAVS26 added good first issue Good for newcomers Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. Status: Assigned💻 Indicates an issue has been assigned to a contributor. level2 gssoc-ext hacktoberfest labels Oct 3, 2024
@Sakeebhasan123456
Copy link
Contributor Author

hii @UTSAVS26 please check i send the pull request

Copy link

github-actions bot commented Oct 4, 2024

✅ This issue has been closed. Thank you for your contribution! If you have any further questions or issues, feel free to join our community on Discord to discuss more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. good first issue Good for newcomers gssoc-ext hacktoberfest Level Update level1 Status: Assigned💻 Indicates an issue has been assigned to a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants