Skip to content

Commit

Permalink
Add Antonym Substitution data to Negation Dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
dmlls committed Apr 29, 2023
1 parent f4817b1 commit 50563fd
Show file tree
Hide file tree
Showing 3 changed files with 77,405 additions and 11 deletions.
39 changes: 28 additions & 11 deletions negation-dataset/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,44 @@
### Negation Dataset

This dataset currently contains **60592 samples**, of which half of them are
negated pairs of sentences, and the other half are not (they are paraphrased
versions of each other).
The version 1.1 of the dataset contains **77376 samples**, of which roughly of
them are negated pairs of sentences, and the other half are not (they are
paraphrased versions of each other).

<br>

The dataset has been created by cleaning up and merging the following datasets:

* _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
Negation_ (see
[`nan-nli`](https://github.com/dmlls/negation-datasets/tree/main/nan-nli))
[`nan-nli`](https://github.com/dmlls/negation-datasets/tree/main/nan-nli)).

2. _GLUE Diagnostic Dataset_ (see
[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/glue-diagnostic)).

3. _Automated Fact-Checking of Claims from Wikipedia_ (see
[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/wikifactcheck-english)).

* _GLUE Diagnostic Dataset_ (see
[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/glue-diagnostic))
4. _From Group to Individual Labels Using Deep Features_ (see
[`sentiment-labelled-sentences`](https://github.com/dmlls/negation-datasets/tree/main/sentiment-labelled-sentences)).
In this case, the negated sentences were obtained by using the Python module
[`negate`](https://github.com/dmlls/negate).

* _Automated Fact-Checking of Claims from Wikipedia_ (see
[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/wikifactcheck-english))

Additionally, for each of the negated samples, another pair of non-negated
sentences has been added by paraphrasing them with the pre-trained model
sentences has been added by paraphrasing them with the pre-trained model
[`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase).

Finally, the dataset from _It Is Not Easy To Detect Paraphrases: Analysing
Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg
Benchmark_ (see
[`antonym-substitution`](https://github.com/dmlls/negation-datasets/tree/main/antonym-substitution))
has also been included. This dataset already provides both the paraphrased and
negated version for each premise, so no further processing was needed.

<br>

The resulting file is a
[`.tsv`](https://github.com/dmlls/negation-datasets/blob/main/negation-dataset/negation_dataset.tsv)
[`.tsv`](https://github.com/dmlls/negation-datasets/blob/main/negation-dataset/negation_dataset_v1.1.tsv)
with the following format:

| premise | hypothesis | label |
Expand Down
File renamed without changes.
Loading

0 comments on commit 50563fd

Please sign in to comment.