Skip to content


Repository files navigation

LexiGuard logo


Detecting toxicity in online comments using an LSTM recurrent neural network

Capstone group project for neue fische Data Science Bootcamp:
Presentation slides (PDF)
Presentation video


The most convenient way to run the notebooks contained in this repo is probably running them in the cloud, e.g. on Google Colab. To do so open a notebook and click on the Colab badge at the top.

If you would like to run the notebooks locally on your own machine, you may want to install Anaconda distribution and create a virtual environment using the included environment.yml (conda env create -f environment.yml).

To run the Streamlit prototype dashboard, use this command: python -m streamlit run

Repo Contents

File Description
eda.ipynb Initial exploratory data analysis
data_preprocessing.ipynb Create data file(s)
baseline_model.ipynb Baseline model (BOW + logistic regression)
random_forest.ipynb Random forest experiments
xgboost.ipynb XGBoost experiments
lstm.ipynb LSTM final model (TensorFlow) Very basic prototype dashboard using Streamlit
functions.ipynb Utitlity functions


The project was the group's first trip into the field of NLP. It was thus foremost about learning and trying out things. Many of these things did not make it into the final project version. Some examples of what we also tinkered around with:

  • SpaCy (for vectorization)
  • BERT / Hugging Face Transformers
  • fastText (for vectorization)
  • Gensim
  • Naive Bayes
  • random undersampling
  • POS tagging
  • stemming


Code based on collaborative work by:
André Oliveira (Bambuzera)
Eric Martinez (ericmartinez1189)
Purvi Parmar (PurviDParmar)
Michael Schickenberg (CalleRosa40)


Detecting toxicity in online comments



