Skip to content

Commit

Permalink
[docs] Spanish translation of tokenizer_summary.md (#31154)
Browse files Browse the repository at this point in the history
* add tokenizer_summary to es/_toctree.yml

* add tokenizer_summary to es/

* fix link to Transformes XL in en/

* translate until Subword tokenization section

* fix GPT link in en/

* fix other GPT link in en/

* fix typo in en/

* translate the doc

* run make fixup

* Remove .md in Transformer XL link

* fix some link issues in es/

* fix typo
  • Loading branch information
aaronjimv authored Jun 3, 2024
1 parent 8a1a23a commit c73ee13
Show file tree
Hide file tree
Showing 3 changed files with 181 additions and 4 deletions.
8 changes: 4 additions & 4 deletions docs/source/en/tokenizer_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ As can be seen space and punctuation tokenization, as well as rule-based tokeniz
punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transfo-xl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!

Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
Expand Down Expand Up @@ -142,7 +142,7 @@ on.
Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et
al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into
words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm),
[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses
[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/openai-gpt) which uses
spaCy and ftfy, to count the frequency of each word in the training corpus.

After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the
Expand Down Expand Up @@ -195,7 +195,7 @@ the symbol `"m"` is not in the base vocabulary. In general, single letters such
to happen for very special characters like emojis.

As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters
to choose. For instance [GPT](model_doc/openai-gpt) has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop training after 40,000 merges.

#### Byte-level BPE
Expand Down Expand Up @@ -268,7 +268,7 @@ $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )

All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
separate words. However, not all languages use spaces to separate words. One possible solution is to use language
specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer).
specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer.
To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and
detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input
as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
Expand Down
2 changes: 2 additions & 0 deletions docs/source/es/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@
title: Lo que 🤗 Transformers puede hacer
- local: tasks_explained
title: Como los 🤗 Transformers resuelven tareas
- local: tokenizer_summary
title: Descripción general de los tokenizadores
- local: attention
title: Mecanismos de atención
- local: pad_truncation
Expand Down
Loading

0 comments on commit c73ee13

Please sign in to comment.