slides

tulane-cmps6730 · Apr 30, 2024 · 7741544 · 7741544
1 parent 3936f35
commit 7741544
Show file tree

Hide file tree

Showing 12 changed files with 134 additions and 89 deletions.
diff --git a/app/app.py b/app/app.py
@@ -119,13 +119,13 @@ def predict_bow():
         preds = []
         data = request.get_json(force=True)
         texts = data['text']
-        print(texts)
+        # print(texts)
         preprocessed_text = [preprocess(text, n=2) for text in texts.split('.')]
         texts_joined = [' '.join(text) for text in preprocessed_text]
-        print(texts_joined)
+        # print(texts_joined)
         vectorized_text = vectorizer.transform(texts_joined)
         preds = bow_model.predict(vectorized_text)
-        print(preds)
+        # print(preds)
         return jsonify(prediction=preds.tolist(),text=texts.split('.'))
     return None
 

diff --git a/docs/_posts/0000-01-01-intro.md b/docs/_posts/0000-01-01-intro.md
@@ -1,6 +1,8 @@
 ---
 layout: slide
-title: "NLP Project"
+title: "Using Natural Language Processing to Identify Unfair Clauses in Terms and Conditions Documents"
 ---
 
-Use the right arrow to begin!
+**Authors:** Jonathan Sears, Nick Radwin  
+**Institution:** Tulane University  
+**Emails:** [email protected], [email protected]
diff --git a/docs/_posts/0000-01-02-overview.md b/docs/_posts/0000-01-02-overview.md
@@ -1,19 +1,9 @@
 ---
 layout: slide
-title: "Equations and Tables"
+title: "Introduction"
 ---
 
 
-Here is an inline equation: $\sum_{i=1}^n i = ?$
+## Introduction
 
-And a block one:
-
-$$e = mc^2$$
-
-
-Here is a table:
-
-| header 1 | header 2 |
-|----------|----------|
-| value 1  | value 2  |
-| value 3  | value 4  |
+Despite their ubiquity, terms and conditions are seldom read by users, leading to widespread ignorance about potentially exploitative or unfair clauses. Our project aims to bring these hidden clauses to light by using a sentence level text classifier that labels clauses as either exploitative (1) or non exploitative(0). We based these labels off of categories as outlined in a prior paper we will discuss shortly.
diff --git a/docs/_posts/0000-01-03-next.md b/docs/_posts/0000-01-03-next.md
@@ -1,13 +1,9 @@
 ---
 layout: slide
-title: "Images"
+title: "Related Work"
 ---
+Our experiments are Primarily based off of **CLAUDETTE** a research project conducted at Stanford in 2018. 
 
+They ultimately used an ensemble method, combining SVMs with LSTMs ,and CNNs, to achieve accuracy and f1-scores above .8. This was our target for this project. 
 
-Two ways to add an image.
-
-Note that the image is in the assets/img folder.
-
-<img src="{{ site.baseurl }}/assets/img/tulane.png" width="50%">
-
-![tulane](assets/img/tulane.png)
+![claudette](assets/img/claudette.png)
diff --git a/docs/_posts/0000-01-04-approach.md b/docs/_posts/0000-01-04-approach.md
@@ -0,0 +1,12 @@
+---
+layout: slide
+title: "Approach"
+---
+
+We employed multiple machine learning approaches to address the challenge of identifying unfair clauses:
+- **BERT models:** Utilized for their deep contextual representations.
+- **Bag of Words (BoW):** Simplified text representation focusing on term frequencies.
+- **Support Vector Machine (SVM):** Tested for its capability to establish a clear decision boundary.
+- **Convolutional Neural Network (CNN):** Explored for its pattern recognition capabilities within text data.
+- **Gradient Boosting Machine (GBM):** Chosen for its robustness and iterative improvement on classification tasks.
+- **Hybrid BERT/BoW model:** An attempt to combine the strengths of BERT and BoW models.
diff --git a/docs/_posts/0000-01-04-conclusion.md b/docs/_posts/0000-01-04-conclusion.md
diff --git a/docs/_posts/0000-01-05-dataset-and-methodology.md b/docs/_posts/0000-01-05-dataset-and-methodology.md
@@ -0,0 +1,7 @@
+---
+layout: slide
+title: "Dataset and Metrics"
+---
+- **Dataset:** Consisted of 100 labeled terms and conditions documents, each sentence categorized as either fair or one of nine subcategories of unfair.
+- **Binary Classification:** Simplified from multiple to two classes (fair and unfair) to address the dataset's imbalance (92% unfair).
+- **Evaluation Metrics:** Precision, recall, and F1 score, with models trained on an evenly distributed sample for fairness in performance evaluation.
diff --git a/docs/_posts/0000-01-06-Experiments.md b/docs/_posts/0000-01-06-Experiments.md
@@ -0,0 +1,6 @@
+---
+layout: slide
+title: "Experiments"
+---
+We originally experimented with the more complex BERT representation of the text. The thinking behind this was that the BERT encodings would be able to capture a better understanding of the text both semantically and contextually. We experimented with many different methods of fine tuning BERT, attempting to fine tune a single classifier layer on to of pooled
+However we were unable to produce results near that of claudette, with our best variants of the fine tuned BERT model unable to crack an f1-score of .6
diff --git a/docs/_posts/0000-01-07-Experiments-hybrid.md b/docs/_posts/0000-01-07-Experiments-hybrid.md
diff --git a/docs/_posts/0000-01-08-Experiments-bow.md b/docs/_posts/0000-01-08-Experiments-bow.md
diff --git a/docs/assets/img/claudette.png b/docs/assets/img/claudette.png