diff --git a/README.md b/README.md index f2e7c67..dc54f5d 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,127 @@ -# negation-datasets -Curated datasets for sentence negation. +

CANNOT dataset

+

+ + + +

+

Compilation of ANnotated, Negation-Oriented Text-pairs

+ +

+ +## Introduction + +**CANNOT** is a dataset that focuses on negated textual pairs. It currently +contains **77,376 samples**, of which roughly of them are negated pairs of +sentences, and the other half are not (they are paraphrased versions of each +other). + +The most frequent negation that appears in the dataset is verbal negation (e.g., +will → won't), although it also contains pairs with antonyms (cold → hot). + +
+ +## Format + +The dataset is given as a +[`.tsv`](https://en.wikipedia.org/wiki/Tab-separated_values) file with the +following structure: + +| premise | hypothesis | label | +|:------------|:---------------------------------------------------|:-----:| +| A sentence. | An equivalent, non-negated sentence (paraphrased). | 0 | +| A sentence. | The sentence negated. | 1 | + +
+ +The dataset can be easily loaded into a Pandas DataFrame by running: + +```Python +import pandas as pd + +dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t') + +``` + +
+ +## Construction + +The dataset has been created by cleaning up and merging the following datasets: + +1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal + Negation_ (see +[`datasets/nan-nli`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/nan-nli)). + +2. _GLUE Diagnostic Dataset_ (see +[`datasets/glue-diagnostic`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/glue-diagnostic)). + +3. _Automated Fact-Checking of Claims from Wikipedia_ (see +[`datasets/wikifactcheck-english`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/wikifactcheck-english)). + +4. _From Group to Individual Labels Using Deep Features_ (see +[`datasets/sentiment-labelled-sentences`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/sentiment-labelled-sentences)). +In this case, the negated sentences were obtained by using the Python module +[`negate`](https://github.com/dmlls/negate). + +5. _It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With +Antonyms and Negation Using the New SemAntoNeg Benchmark_ (see +[`datasets/antonym-substitution`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/antonym-substitution)). + +
+ +Additionally, for each of the negated samples, another pair of non-negated +sentences has been added by paraphrasing them with the pre-trained model +[`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase). + +Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been +included, and any duplicates have been removed. + +The contribution of each of these individual datasets to the final CANNOT +dataset is: + +| Dataset | Samples | +|:--------------------------------------------------------------------------|-----------:| +| Not another Negation Benchmark | 118 | +| GLUE Diagnostic Dataset | 154 | +| Automated Fact-Checking of Claims from Wikipedia | 14,970 | +| From Group to Individual Labels Using Deep Features | 2,110 | +| It Is Not Easy To Detect Paraphrases | 8,597 | +|

Total

| **25,949** | + +_Note_: The numbers above include only the original queries present in the +datasets. + +
+ +## Contributions + +Questions? Bugs...? Then feel free to [open a new +issue](https://github.com/dmlls/cannot-dataset/issues/new/). + +
+ +## Acknowledgments + +We thank all the previous authors that have made this dataset possible: + +Thinh Hung Truong, Yulia Otmakhova, Timothy Baldwin, Trevor Cohn, Jey Han Lau, +Karin Verspoor, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer +Levy, Samuel R. Bowman, Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, +Joonsuk Park, Dimitrios Kotzias, Misha Denil, Nando De Freitas, Padhraic Smyth, +Teemu Vahtola, Mathias Creutz, and Jörg Tiedemann. + +
+ +## License + +The CANNOT dataset is released under [CC BY-SA +4.0](https://creativecommons.org/licenses/by-sa/4.0/). + + + Creative Commons License + + +

+ +## Citation +tba diff --git a/cannot-dataset/README.md b/cannot-dataset/README.md deleted file mode 100644 index 739297f..0000000 --- a/cannot-dataset/README.md +++ /dev/null @@ -1,47 +0,0 @@ -### Negation Dataset - -The version 1.1 of the dataset contains **77376 samples**, of which roughly of -them are negated pairs of sentences, and the other half are not (they are -paraphrased versions of each other). - -
- -The dataset has been created by cleaning up and merging the following datasets: - -1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal -Negation_ (see -[`nan-nli`](https://github.com/dmlls/negation-datasets/tree/main/nan-nli)). - -2. _GLUE Diagnostic Dataset_ (see -[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/glue-diagnostic)). - -3. _Automated Fact-Checking of Claims from Wikipedia_ (see -[`glue-diagnostic`](https://github.com/dmlls/negation-datasets/tree/main/wikifactcheck-english)). - -4. _From Group to Individual Labels Using Deep Features_ (see -[`sentiment-labelled-sentences`](https://github.com/dmlls/negation-datasets/tree/main/sentiment-labelled-sentences)). -In this case, the negated sentences were obtained by using the Python module -[`negate`](https://github.com/dmlls/negate). - - -Additionally, for each of the negated samples, another pair of non-negated -sentences has been added by paraphrasing them with the pre-trained model -[`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase). - -Finally, the dataset from _It Is Not Easy To Detect Paraphrases: Analysing -Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg -Benchmark_ (see -[`antonym-substitution`](https://github.com/dmlls/negation-datasets/tree/main/antonym-substitution)) -has also been included. This dataset already provides both the paraphrased and -negated version for each premise, so no further processing was needed. - -
- -The resulting file is a -[`.tsv`](https://github.com/dmlls/negation-datasets/blob/main/negation-dataset/negation_dataset_v1.1.tsv) -with the following format: - -| premise | hypothesis | label | -|:------------|:----------------------------------------|:-----:| -| A sentence. | The sentence non-negated (paraphrased). | 0 | -| A sentence. | The sentence negated. | 1 |