From a9e78659228c4ea68788bbb5ae07c040da8a4ba3 Mon Sep 17 00:00:00 2001 From: "Michael A. Hedderich" Date: Tue, 27 Jul 2021 13:17:31 +0200 Subject: [PATCH 1/2] Extended Data Augmentation -> Text - Added a structure into token level, sentence part level and sentence level augmentation - Added some more references - Replaced arxiv with ACL-Anthology links --- augmentation.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/augmentation.md b/augmentation.md index 12345e8..f1fd543 100644 --- a/augmentation.md +++ b/augmentation.md @@ -57,14 +57,16 @@ While these primitives have culminated in compelling performance gains, they can Heuristic transformations for text typically involve paraphrasing text in order to produce more diverse samples. -- [Backtranslation](https://arxiv.org/abs/1511.06709) uses a round-trip translation from a source to target language and back in order to generate a paraphrase. - Examples of use include [QANet](https://arxiv.org/abs/1804.09541). -- Synonym substitution methods replace words with their synonyms such as in - [Data Augmentation for Low-Resource Neural Machine Translation](https://www.aclweb.org/anthology/P17-2090/), - [Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations](https://www.aclweb.org/anthology/N18-2072/), +- On a token level, synonym substitution methods replace words with their synonyms. Synonyms might be chosen based on + - a knowledge base such as a thesaurus: e.g. [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf) and [An Analysis of Simple Data Augmentation for Named Entity Recognition](https://aclanthology.org/2020.coling-main.343/) + - neighbors in a word embedding space: e.g. [That’s So Annoying!!!](https://www.aclweb.org/anthology/D15-1306/) + - probable words according to a language model that takes the sentence context into account: e.g. [Model-Portability Experiments for Textual Temporal Analysis](https://www.aclweb.org/anthology/P11-2047/), - [That’s So Annoying!!!](https://www.aclweb.org/anthology/D15-1306/) and - [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf) + [Data Augmentation for Low-Resource Neural Machine Translation](https://www.aclweb.org/anthology/P17-2090/) and + [Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations](https://www.aclweb.org/anthology/N18-2072/) +- Sentence parts can be reordered by manipulating the syntax tree of a sentence: e.g. [Data augmentation via dependency tree morphing for low-resource languages](https://aclanthology.org/D18-1545/) +- The whole sentence can be modified via [Backtranslation](https://aclanthology.org/P16-1009/). There a round-trip translation from a source to target language and back is used to generate a paraphrase. Examples of use include [QANet](https://arxiv.org/abs/1804.09541) and [Unsupervised Data Augmentation for Consistency Training](https://proceedings.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html). + [comment]: <> (- Noising) [comment]: <> (- Grammar induction) From 086cd2bc27e5dc7e22b82c3758571c29f69d1ac6 Mon Sep 17 00:00:00 2001 From: Karan Goel Date: Tue, 27 Jul 2021 17:37:32 -0400 Subject: [PATCH 2/2] Update THANKS.md --- THANKS.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/THANKS.md b/THANKS.md index ae2c04d..4aa067d 100644 --- a/THANKS.md +++ b/THANKS.md @@ -14,5 +14,6 @@ The following individuals and organizations have contributed to the development - [Ce Zhang](https://scholar.google.ch/citations?user=GkXqbmMAAAAJ&hl=en) and [Cedric Renggli](https://people.inf.ethz.ch/rengglic/) from ETH-Zurich added discussion for data cleaning and MLOps - [Eugene Wu](http://www.cs.columbia.edu/~ewu/) from Columbia added discussion for data cleaning - [Cody Coleman](http://www.codycoleman.com) from Stanford added discussion for data selection +- [Michael Hedderich](https://michael-hedderich.de) from Saarland Informatics added discussion for data augmentation -Thanks to everyone who has provided feedback on this resource, including Dan Hendrycks and Jacob Steinhardt at UC-Berkeley, James Zou, Matei Zaharia, Daniel Kang, Chelsea Finn from Stanford, Mike Cafarella from MIT, Ameet Talkwalkar from CMU. \ No newline at end of file +Thanks to everyone who has provided feedback on this resource, including Dan Hendrycks and Jacob Steinhardt at UC-Berkeley, James Zou, Matei Zaharia, Daniel Kang, Chelsea Finn from Stanford, Mike Cafarella from MIT, Ameet Talkwalkar from CMU.