Skip to content

Commit

Permalink
Merge pull request #38 from michael-aloys/patch-2
Browse files Browse the repository at this point in the history
Extended Data Augmentation -> Text
  • Loading branch information
krandiash authored Jul 27, 2021
2 parents 7d9ca11 + 086cd2b commit 82dbc00
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 8 deletions.
3 changes: 2 additions & 1 deletion THANKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@ The following individuals and organizations have contributed to the development
- [Ce Zhang](https://scholar.google.ch/citations?user=GkXqbmMAAAAJ&hl=en) and [Cedric Renggli](https://people.inf.ethz.ch/rengglic/) from ETH-Zurich added discussion for data cleaning and MLOps
- [Eugene Wu](http://www.cs.columbia.edu/~ewu/) from Columbia added discussion for data cleaning
- [Cody Coleman](http://www.codycoleman.com) from Stanford added discussion for data selection
- [Michael Hedderich](https://michael-hedderich.de) from Saarland Informatics added discussion for data augmentation

Thanks to everyone who has provided feedback on this resource, including Dan Hendrycks and Jacob Steinhardt at UC-Berkeley, James Zou, Matei Zaharia, Daniel Kang, Chelsea Finn from Stanford, Mike Cafarella from MIT, Ameet Talkwalkar from CMU.
Thanks to everyone who has provided feedback on this resource, including Dan Hendrycks and Jacob Steinhardt at UC-Berkeley, James Zou, Matei Zaharia, Daniel Kang, Chelsea Finn from Stanford, Mike Cafarella from MIT, Ameet Talkwalkar from CMU.
16 changes: 9 additions & 7 deletions augmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,16 @@ While these primitives have culminated in compelling performance gains, they can

Heuristic transformations for text typically involve paraphrasing text in order to produce more diverse samples.

- [Backtranslation](https://arxiv.org/abs/1511.06709) uses a round-trip translation from a source to target language and back in order to generate a paraphrase.
Examples of use include [QANet](https://arxiv.org/abs/1804.09541).
- Synonym substitution methods replace words with their synonyms such as in
[Data Augmentation for Low-Resource Neural Machine Translation](https://www.aclweb.org/anthology/P17-2090/),
[Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations](https://www.aclweb.org/anthology/N18-2072/),
- On a token level, synonym substitution methods replace words with their synonyms. Synonyms might be chosen based on
- a knowledge base such as a thesaurus: e.g. [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf) and [An Analysis of Simple Data Augmentation for Named Entity Recognition](https://aclanthology.org/2020.coling-main.343/)
- neighbors in a word embedding space: e.g. [That’s So Annoying!!!](https://www.aclweb.org/anthology/D15-1306/)
- probable words according to a language model that takes the sentence context into account: e.g.
[Model-Portability Experiments for Textual Temporal Analysis](https://www.aclweb.org/anthology/P11-2047/),
[That’s So Annoying!!!](https://www.aclweb.org/anthology/D15-1306/) and
[Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf)
[Data Augmentation for Low-Resource Neural Machine Translation](https://www.aclweb.org/anthology/P17-2090/) and
[Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations](https://www.aclweb.org/anthology/N18-2072/)
- Sentence parts can be reordered by manipulating the syntax tree of a sentence: e.g. [Data augmentation via dependency tree morphing for low-resource languages](https://aclanthology.org/D18-1545/)
- The whole sentence can be modified via [Backtranslation](https://aclanthology.org/P16-1009/). There a round-trip translation from a source to target language and back is used to generate a paraphrase. Examples of use include [QANet](https://arxiv.org/abs/1804.09541) and [Unsupervised Data Augmentation for Consistency Training](https://proceedings.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html).


[comment]: <> (- Noising)
[comment]: <> (- Grammar induction)
Expand Down

0 comments on commit 82dbc00

Please sign in to comment.