Skip to content

Commit

Permalink
slides
Browse files Browse the repository at this point in the history
  • Loading branch information
JonathanSears1 committed Apr 30, 2024
1 parent 3936f35 commit 7741544
Show file tree
Hide file tree
Showing 12 changed files with 134 additions and 89 deletions.
6 changes: 3 additions & 3 deletions app/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,13 +119,13 @@ def predict_bow():
preds = []
data = request.get_json(force=True)
texts = data['text']
print(texts)
# print(texts)
preprocessed_text = [preprocess(text, n=2) for text in texts.split('.')]
texts_joined = [' '.join(text) for text in preprocessed_text]
print(texts_joined)
# print(texts_joined)
vectorized_text = vectorizer.transform(texts_joined)
preds = bow_model.predict(vectorized_text)
print(preds)
# print(preds)
return jsonify(prediction=preds.tolist(),text=texts.split('.'))
return None

Expand Down
6 changes: 4 additions & 2 deletions docs/_posts/0000-01-01-intro.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
---
layout: slide
title: "NLP Project"
title: "Using Natural Language Processing to Identify Unfair Clauses in Terms and Conditions Documents"
---

Use the right arrow to begin!
**Authors:** Jonathan Sears, Nick Radwin
**Institution:** Tulane University
**Emails:** [email protected], [email protected]
16 changes: 3 additions & 13 deletions docs/_posts/0000-01-02-overview.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,9 @@
---
layout: slide
title: "Equations and Tables"
title: "Introduction"
---


Here is an inline equation: $\sum_{i=1}^n i = ?$
## Introduction

And a block one:

$$e = mc^2$$


Here is a table:

| header 1 | header 2 |
|----------|----------|
| value 1 | value 2 |
| value 3 | value 4 |
Despite their ubiquity, terms and conditions are seldom read by users, leading to widespread ignorance about potentially exploitative or unfair clauses. Our project aims to bring these hidden clauses to light by using a sentence level text classifier that labels clauses as either exploitative (1) or non exploitative(0). We based these labels off of categories as outlined in a prior paper we will discuss shortly.
12 changes: 4 additions & 8 deletions docs/_posts/0000-01-03-next.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
---
layout: slide
title: "Images"
title: "Related Work"
---
Our experiments are Primarily based off of **CLAUDETTE** a research project conducted at Stanford in 2018.

They ultimately used an ensemble method, combining SVMs with LSTMs ,and CNNs, to achieve accuracy and f1-scores above .8. This was our target for this project.

Two ways to add an image.

Note that the image is in the assets/img folder.

<img src="{{ site.baseurl }}/assets/img/tulane.png" width="50%">

![tulane](assets/img/tulane.png)
![claudette](assets/img/claudette.png)
12 changes: 12 additions & 0 deletions docs/_posts/0000-01-04-approach.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
layout: slide
title: "Approach"
---

We employed multiple machine learning approaches to address the challenge of identifying unfair clauses:
- **BERT models:** Utilized for their deep contextual representations.
- **Bag of Words (BoW):** Simplified text representation focusing on term frequencies.
- **Support Vector Machine (SVM):** Tested for its capability to establish a clear decision boundary.
- **Convolutional Neural Network (CNN):** Explored for its pattern recognition capabilities within text data.
- **Gradient Boosting Machine (GBM):** Chosen for its robustness and iterative improvement on classification tasks.
- **Hybrid BERT/BoW model:** An attempt to combine the strengths of BERT and BoW models.
6 changes: 0 additions & 6 deletions docs/_posts/0000-01-04-conclusion.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/_posts/0000-01-05-dataset-and-methodology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
layout: slide
title: "Dataset and Metrics"
---
- **Dataset:** Consisted of 100 labeled terms and conditions documents, each sentence categorized as either fair or one of nine subcategories of unfair.
- **Binary Classification:** Simplified from multiple to two classes (fair and unfair) to address the dataset's imbalance (92% unfair).
- **Evaluation Metrics:** Precision, recall, and F1 score, with models trained on an evenly distributed sample for fairness in performance evaluation.
6 changes: 6 additions & 0 deletions docs/_posts/0000-01-06-Experiments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
layout: slide
title: "Experiments"
---
We originally experimented with the more complex BERT representation of the text. The thinking behind this was that the BERT encodings would be able to capture a better understanding of the text both semantically and contextually. We experimented with many different methods of fine tuning BERT, attempting to fine tune a single classifier layer on to of pooled
However we were unable to produce results near that of claudette, with our best variants of the fine tuned BERT model unable to crack an f1-score of .6
Empty file.
Empty file.
Binary file added docs/assets/img/claudette.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 7741544

Please sign in to comment.