From cd2f30dc08965da3e5ae5850e3d9f4d6b0fa613b Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 10:29:56 -0800 Subject: [PATCH 01/12] add tokenizer_summary to es/_toctree.yml --- docs/source/es/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/es/_toctree.yml b/docs/source/es/_toctree.yml index 5a20aca2e56a35..45dd27abaf100a 100644 --- a/docs/source/es/_toctree.yml +++ b/docs/source/es/_toctree.yml @@ -92,6 +92,8 @@ title: Lo que 🤗 Transformers puede hacer - local: tasks_explained title: Como los 🤗 Transformers resuelven tareas + - local: tokenizer_summary + title: Descripción general de los tokenizadores - local: attention title: Mecanismos de atención - local: pad_truncation From 2d1c1cf2922e01bdaf541fa66576d68489f447d3 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 10:31:52 -0800 Subject: [PATCH 02/12] add tokenizer_summary to es/ --- docs/source/es/tokenizer_summary.md | 282 ++++++++++++++++++++++++++++ 1 file changed, 282 insertions(+) create mode 100644 docs/source/es/tokenizer_summary.md diff --git a/docs/source/es/tokenizer_summary.md b/docs/source/es/tokenizer_summary.md new file mode 100644 index 00000000000000..fbe8f6f7a17743 --- /dev/null +++ b/docs/source/es/tokenizer_summary.md @@ -0,0 +1,282 @@ + + +# Summary of the tokenizers + +[[open-in-colab]] + +On this page, we will have a closer look at tokenization. + + + +As we saw in [the preprocessing tutorial](preprocessing), tokenizing a text is splitting it into words or +subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is +straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text). +More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding +(BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples +of which tokenizer type is used by which model. + +Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer +type was used by the pretrained model. For instance, if we look at [`BertTokenizer`], we can see +that the model uses [WordPiece](#wordpiece). + +## Introduction + +Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so. +For instance, let's look at the sentence `"Don't you love 🤗 Transformers? We sure do."` + + + +A simple way of tokenizing this text is to split it by spaces, which would give: + +``` +["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] +``` + +This is a sensible first step, but if we look at the tokens `"Transformers?"` and `"do."`, we notice that the +punctuation is attached to the words `"Transformer"` and `"do"`, which is suboptimal. We should take the +punctuation into account so that a model does not have to learn a different representation of a word and every possible +punctuation symbol that could follow it, which would explode the number of representations the model has to learn. +Taking punctuation into account, tokenizing our exemplary text would give: + +``` +["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +Better. However, it is disadvantageous, how the tokenization dealt with the word `"Don't"`. `"Don't"` stands for +`"do not"`, so it would be better tokenized as `["Do", "n't"]`. This is where things start getting complicated, and +part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a +different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an +input that was tokenized with the same rules that were used to tokenize its training data. + +[spaCy](https://spacy.io/) and [Moses](http://www.statmt.org/moses/?n=Development.GetStarted) are two popular +rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like: + +``` +["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and +punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined +as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this +tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization +usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! + +Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which +causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size +greater than 50,000, especially if they are pretrained only on a single language. + +So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? + + + +While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder +for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent +representation for the letter `"t"` is much harder than learning a context-independent representation for the word +`"today"`. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of +both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword** +tokenization. + +## Subword tokenization + + + +Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller +subwords, but rare words should be decomposed into meaningful subwords. For instance `"annoyingly"` might be +considered a rare word and could be decomposed into `"annoying"` and `"ly"`. Both `"annoying"` and `"ly"` as +stand-alone subwords would appear more frequently while at the same time the meaning of `"annoyingly"` is kept by the +composite meaning of `"annoying"` and `"ly"`. This is especially useful in agglutinative languages such as Turkish, +where you can form (almost) arbitrarily long complex words by stringing together subwords. + +Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful +context-independent representations. In addition, subword tokenization enables the model to process words it has never +seen before, by decomposing them into known subwords. For instance, the [`~transformers.BertTokenizer`] tokenizes +`"I have a new GPU!"` as follows: + +```py +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> tokenizer.tokenize("I have a new GPU!") +["i", "have", "a", "new", "gp", "##u", "!"] +``` + +Because we are considering the uncased model, the sentence was lowercased first. We can see that the words `["i", "have", "a", "new"]` are present in the tokenizer's vocabulary, but the word `"gpu"` is not. Consequently, the +tokenizer splits `"gpu"` into known subwords: `["gp" and "##u"]`. `"##"` means that the rest of the token should +be attached to the previous one, without space (for decoding or reversal of the tokenization). + +As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously exemplary text as follows: + +```py +>>> from transformers import XLNetTokenizer + +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") +>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") +["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] +``` + +We'll get back to the meaning of those `"▁"` when we look at [SentencePiece](#sentencepiece). As one can see, +the rare word `"Transformers"` has been split into the more frequent subwords `"Transform"` and `"ers"`. + +Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization +algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained +on. + + + +### Byte-Pair Encoding (BPE) + +Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et +al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into +words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), +[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses +spaCy and ftfy, to count the frequency of each word in the training corpus. + +After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the +training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set +of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until +the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to +define before training the tokenizer. + +As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been +determined: + +``` +("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) +``` + +Consequently, the base vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. Splitting all words into symbols of the +base vocabulary, we obtain: + +``` +("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) +``` + +BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In +the example above `"h"` followed by `"u"` is present _10 + 5 = 15_ times (10 times in the 10 occurrences of +`"hug"`, 5 times in the 5 occurrences of `"hugs"`). However, the most frequent symbol pair is `"u"` followed by +`"g"`, occurring _10 + 5 + 5 = 20_ times in total. Thus, the first merge rule the tokenizer learns is to group all +`"u"` symbols followed by a `"g"` symbol together. Next, `"ug"` is added to the vocabulary. The set of words then +becomes + +``` +("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) +``` + +BPE then identifies the next most common symbol pair. It's `"u"` followed by `"n"`, which occurs 16 times. `"u"`, +`"n"` is merged to `"un"` and added to the vocabulary. The next most frequent symbol pair is `"h"` followed by +`"ug"`, occurring 15 times. Again the pair is merged and `"hug"` can be added to the vocabulary. + +At this stage, the vocabulary is `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` and our set of unique words +is represented as + +``` +("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) +``` + +Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied +to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, +the word `"bug"` would be tokenized to `["b", "ug"]` but `"mug"` would be tokenized as `["", "ug"]` since +the symbol `"m"` is not in the base vocabulary. In general, single letters such as `"m"` are not replaced by the +`""` symbol because the training data usually includes at least one occurrence of each letter, but it is likely +to happen for very special characters like emojis. + +As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter +to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters +and chose to stop training after 40,000 merges. + +#### Byte-level BPE + +A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are +considered as base characters. To have a better base vocabulary, [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) uses bytes +as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that +every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's +tokenizer can tokenize every text without the need for the symbol. [GPT-2](model_doc/gpt) has a vocabulary +size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned +with 50,000 merges. + + + +### WordPiece + +WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean +Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very similar to +BPE. WordPiece first initializes the vocabulary to include every character present in the training data and +progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent +symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. + +So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is +equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by +its second symbol is the greatest among all symbol pairs. *E.g.* `"u"`, followed by `"g"` would have only been +merged if the probability of `"ug"` divided by `"u"`, `"g"` would have been greater than for any other symbol +pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it _loses_ by merging two symbols +to ensure it's _worth it_. + + + +### Unigram + +Unigram is a subword tokenization algorithm introduced in [Subword Regularization: Improving Neural Network Translation +Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf). In contrast to BPE or +WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each +symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and +the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in +conjunction with [SentencePiece](#sentencepiece). + +At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training +data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm +computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then +removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those +symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has +reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized. + +Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of +tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary: + +``` +["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], +``` + +`"hugs"` could be tokenized both as `["hug", "s"]`, `["h", "ug", "s"]` or `["h", "u", "g", "s"]`. So which one +to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that +the probability of each possible tokenization can be computed after training. The algorithm simply picks the most +likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their +probabilities. + +Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of +the words \\(x_{1}, \dots, x_{N}\\) and that the set of all possible tokenizations for a word \\(x_{i}\\) is +defined as \\(S(x_{i})\\), then the overall loss is defined as + +$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ + + + +### SentencePiece + +All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to +separate words. However, not all languages use spaces to separate words. One possible solution is to use language +specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer). +To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and +detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input +as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram +algorithm to construct the appropriate vocabulary. + +The [`XLNetTokenizer`] uses SentencePiece for example, which is also why in the example earlier the +`"▁"` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be +concatenated and `"▁"` is replaced by a space. + +All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models +using SentencePiece are [ALBERT](model_doc/albert), [XLNet](model_doc/xlnet), [Marian](model_doc/marian), and [T5](model_doc/t5). From 5272c801f5790a4841681e1ec764c533249e4f32 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 11:21:48 -0800 Subject: [PATCH 03/12] fix link to Transformes XL in en/ --- docs/source/en/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md index fbe8f6f7a17743..838e6f3659188f 100644 --- a/docs/source/en/tokenizer_summary.md +++ b/docs/source/en/tokenizer_summary.md @@ -73,7 +73,7 @@ As can be seen space and punctuation tokenization, as well as rule-based tokeniz punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization -usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! +usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transfo-xl.md) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size From 71b856123d9150de2681895784e58205e9f24c78 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 11:40:02 -0800 Subject: [PATCH 04/12] translate until Subword tokenization section --- docs/source/es/tokenizer_summary.md | 58 ++++++++--------------------- 1 file changed, 15 insertions(+), 43 deletions(-) diff --git a/docs/source/es/tokenizer_summary.md b/docs/source/es/tokenizer_summary.md index fbe8f6f7a17743..2d9b2cc0bd5b96 100644 --- a/docs/source/es/tokenizer_summary.md +++ b/docs/source/es/tokenizer_summary.md @@ -14,83 +14,55 @@ rendered properly in your Markdown viewer. --> -# Summary of the tokenizers +# Descripción general de los tokenizadores [[open-in-colab]] -On this page, we will have a closer look at tokenization. +En esta página, veremos más de cerca la tokenización. -As we saw in [the preprocessing tutorial](preprocessing), tokenizing a text is splitting it into words or -subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is -straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text). -More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding -(BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples -of which tokenizer type is used by which model. +Como vimos en [el tutorial de preprocessamiento](preprocessing), tokenizar un texto es dividirlo en palabras o subpalabras, que luego se convierten en indices o ids a través de una tabla de búsqueda. Convertir palabras o subpalabras en ids es sencillo, así que en esta descripción general, nos centraremos en dividir un texto en palabras o subpalabras (es decir, tokenizar un texto). Más específicamente, examinaremos los tres principales tipos de tokenizadores utilizados en 🤗 Transformers: [Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece) y [SentencePiece](#sentencepiece), y mostraremos ejemplos de qué tipo de tokenizador se utiliza en cada modelo. -Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer -type was used by the pretrained model. For instance, if we look at [`BertTokenizer`], we can see -that the model uses [WordPiece](#wordpiece). +Ten en cuenta que en las páginas de los modelos, puedes ver la documentación del tokenizador asociado para saber qué tipo de tokenizador se utilizó en el modelo preentrenado. Por ejemplo, si miramos [BertTokenizer](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer), podemos ver que dicho modelo utiliza [WordPiece](#wordpiece). -## Introduction +## Introducción -Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so. -For instance, let's look at the sentence `"Don't you love 🤗 Transformers? We sure do."` +Dividir un texto en trozos más pequeños es más difícil de lo que parece, y hay múltiples formas de hacerlo. Por ejemplo, veamos la oración `"Don't you love 🤗 Transformers? We sure do."` -A simple way of tokenizing this text is to split it by spaces, which would give: +Una forma sencilla de tokenizar este texto es dividirlo por espacios, lo que daría: ``` ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] ``` -This is a sensible first step, but if we look at the tokens `"Transformers?"` and `"do."`, we notice that the -punctuation is attached to the words `"Transformer"` and `"do"`, which is suboptimal. We should take the -punctuation into account so that a model does not have to learn a different representation of a word and every possible -punctuation symbol that could follow it, which would explode the number of representations the model has to learn. -Taking punctuation into account, tokenizing our exemplary text would give: +Este es un primer paso sensato, pero si miramos los tokens `"Transformers?"` y `"do."`, notamos que las puntuaciones están unidas a las palabras `"Transformer"` y `"do"`, lo que es subóptimo. Deberíamos tener en cuenta la puntuación para que un modelo no tenga que aprender una representación diferente de una palabra y cada posible símbolo de puntuación que podría seguirle, lo que explotaría el número de representaciones que el modelo tiene que aprender. Teniendo en cuenta la puntuación, tokenizar nuestro texto daría: ``` ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ``` -Better. However, it is disadvantageous, how the tokenization dealt with the word `"Don't"`. `"Don't"` stands for -`"do not"`, so it would be better tokenized as `["Do", "n't"]`. This is where things start getting complicated, and -part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a -different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an -input that was tokenized with the same rules that were used to tokenize its training data. +Mejor. Sin embargo, es desventajoso cómo la tokenización trata la palabra `"Don't"`. `"Don't"` significa `"do not"`, así que sería mejor tokenizada como `["Do", "n't"]`. Aquí es donde las cosas comienzan a complicarse, y es la razon por la que cada modelo tiene su propio tipo de tokenizador. Dependiendo de las reglas que apliquemos para tokenizar un texto, se genera una salida tokenizada diferente para el mismo texto. Un modelo preentrenado solo se desempeña correctamente si se le proporciona una entrada que fue tokenizada con las mismas reglas que se utilizaron para tokenizar sus datos de entrenamiento. -[spaCy](https://spacy.io/) and [Moses](http://www.statmt.org/moses/?n=Development.GetStarted) are two popular -rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like: +[spaCy](https://spacy.io/) y [Moses](http://www.statmt.org/moses/?n=Development.GetStarted) son dos tokenizadores basados en reglas populares. Al aplicarlos en nuestro ejemplo, *spaCy* y *Moses* generarían algo como: ``` ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ``` -As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and -punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined -as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this -tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization -usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! +Como se puede ver, aquí se utiliza tokenización de espacio y puntuación, así como tokenización basada en reglas. La tokenización de espacio y puntuación y la tokenización basada en reglas son ambos ejemplos de tokenización de palabras, que se define de manera simple como dividir oraciones en palabras. Aunque es la forma más intuitiva de dividir textos en trozos más pequeños, este método de tokenización puede generar problemas para corpus de texto masivos. En este caso, la tokenización de espacio y puntuación suele generar un vocabulario muy grande (el conjunto de todas las palabras y tokens únicos utilizados). *Ej.*, [Transformer XL](https://huggingface.co/docs/transformers/main/en/model_doc/transfo-xl) utiliza tokenización de espacio y puntuación, lo que resulta en un tamaño de vocabulario de 267,735. -Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which -causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size -greater than 50,000, especially if they are pretrained only on a single language. +Un tamaño de vocabulario tan grande fuerza al modelo a tener una matriz de embeddings enormemente grande como capa de entrada y salida, lo que causa un aumento tanto en la complejidad de memoria como en la complejidad de tiempo. En general, los modelos de transformadores rara vez tienen un tamaño de vocabulario mayor que 50,000, especialmente si están preentrenados solo en un idioma. -So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? +Entonces, si la simple tokenización de espacios y puntuación es insatisfactoria, ¿por qué no tokenizar simplemente en caracteres? -While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder -for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent -representation for the letter `"t"` is much harder than learning a context-independent representation for the word -`"today"`. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of -both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword** -tokenization. +Aunque la tokenización de caracteres es muy simple y reduciría significativamente la complejidad de memoria y tiempo, hace que sea mucho más difícil para el modelo aprender representaciones de entrada significativas. *Ej.* aprender una representación independiente del contexto para la letra `"t"` es mucho más difícil que aprender una representación independiente del contexto para la palabra `"today"`. Por lo tanto, la tokenización de caracteres suele acompañarse de una pérdida de rendimiento. Así que para obtener lo mejor de ambos mundos, los modelos de transformadores utilizan un híbrido entre la tokenización de nivel de palabra y de nivel de carácter llamada **tokenización de subpalabras**. -## Subword tokenization +## Tokenización de subpalabras From 28eba0a0e0e12d90c918e275b5f2644addb75d7d Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 14:53:29 -0800 Subject: [PATCH 05/12] fix GPT link in en/ --- docs/source/en/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md index 838e6f3659188f..31ed176e4ba730 100644 --- a/docs/source/en/tokenizer_summary.md +++ b/docs/source/en/tokenizer_summary.md @@ -142,7 +142,7 @@ on. Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), -[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses +[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/openai-gpt) which uses spaCy and ftfy, to count the frequency of each word in the training corpus. After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the From 3f63289a3f59e8a8c41f1897f90b4f134a5c7035 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 15:04:08 -0800 Subject: [PATCH 06/12] fix other GPT link in en/ --- docs/source/en/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md index 31ed176e4ba730..f2583441ccd885 100644 --- a/docs/source/en/tokenizer_summary.md +++ b/docs/source/en/tokenizer_summary.md @@ -195,7 +195,7 @@ the symbol `"m"` is not in the base vocabulary. In general, single letters such to happen for very special characters like emojis. As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter -to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters +to choose. For instance [GPT](model_doc/openai-gpt) has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges. #### Byte-level BPE From dbb6295b2a5a836d4898cbd4e262dc3a258b91bd Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 16:13:15 -0800 Subject: [PATCH 07/12] fix typo in en/ --- docs/source/en/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md index f2583441ccd885..90baaebc8fed95 100644 --- a/docs/source/en/tokenizer_summary.md +++ b/docs/source/en/tokenizer_summary.md @@ -268,7 +268,7 @@ $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words. One possible solution is to use language -specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer). +specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer. To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram From 90cacaadcc35ce4443e7e3d3c700fa9443d1e37f Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 16:13:43 -0800 Subject: [PATCH 08/12] translate the doc --- docs/source/es/tokenizer_summary.md | 133 ++++++---------------------- 1 file changed, 27 insertions(+), 106 deletions(-) diff --git a/docs/source/es/tokenizer_summary.md b/docs/source/es/tokenizer_summary.md index 2d9b2cc0bd5b96..e93ddea7a16e27 100644 --- a/docs/source/es/tokenizer_summary.md +++ b/docs/source/es/tokenizer_summary.md @@ -66,17 +66,9 @@ Aunque la tokenización de caracteres es muy simple y reduciría significativame -Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller -subwords, but rare words should be decomposed into meaningful subwords. For instance `"annoyingly"` might be -considered a rare word and could be decomposed into `"annoying"` and `"ly"`. Both `"annoying"` and `"ly"` as -stand-alone subwords would appear more frequently while at the same time the meaning of `"annoyingly"` is kept by the -composite meaning of `"annoying"` and `"ly"`. This is especially useful in agglutinative languages such as Turkish, -where you can form (almost) arbitrarily long complex words by stringing together subwords. - -Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful -context-independent representations. In addition, subword tokenization enables the model to process words it has never -seen before, by decomposing them into known subwords. For instance, the [`~transformers.BertTokenizer`] tokenizes -`"I have a new GPU!"` as follows: +Los algoritmos de tokenización de subpalabras se basan en el principio de que las palabras frecuentemente utilizadas no deberían dividirse en subpalabras más pequeñas, pero las palabras raras deberían descomponerse en subpalabras significativas. Por ejemplo, `"annoyingly"` podría considerarse una palabra rara y descomponerse en `"annoying"` y `"ly"`. Ambas `"annoying"` y `"ly"` como subpalabras independientes aparecerían con más frecuencia al mismo tiempo que se mantiene el significado de `"annoyingly"` por el significado compuesto de `"annoying"` y `"ly"`. Esto es especialmente útil en lenguas aglutinantes como el turco, donde puedes formar palabras complejas (casi) arbitrariamente largas concatenando subpalabras. + +La tokenización de subpalabras permite al modelo tener un tamaño de vocabulario razonable mientras puede aprender representaciones contextuales independientes significativas. Además, la tokenización de subpalabras permite al modelo procesar palabras que nunca ha visto antes, descomponiéndolas en subpalabras conocidas. Por ejemplo, el tokenizador [`~transformers.BertTokenizer`](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer) tokeniza `"I have a new GPU!"` de la siguiente manera: ```py >>> from transformers import BertTokenizer @@ -86,11 +78,9 @@ seen before, by decomposing them into known subwords. For instance, the [`~trans ["i", "have", "a", "new", "gp", "##u", "!"] ``` -Because we are considering the uncased model, the sentence was lowercased first. We can see that the words `["i", "have", "a", "new"]` are present in the tokenizer's vocabulary, but the word `"gpu"` is not. Consequently, the -tokenizer splits `"gpu"` into known subwords: `["gp" and "##u"]`. `"##"` means that the rest of the token should -be attached to the previous one, without space (for decoding or reversal of the tokenization). +Debido a que estamos considerando el modelo sin mayúsculas, la oración se convirtió a minúsculas primero. Podemos ver que las palabras `["i", "have", "a", "new"]` están presentes en el vocabulario del tokenizador, pero la palabra `"gpu"` no. En consecuencia, el tokenizador divide `"gpu"` en subpalabras conocidas: `["gp" y "##u"]`. `"##"` significa que el resto del token debería adjuntarse al anterior, sin espacio (para decodificar o revertir la tokenización). -As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously exemplary text as follows: +Como otro ejemplo, el tokenizador [`~transformers.XLNetTokenizer`](https://huggingface.co/docs/transformers/en/model_doc/xlnet#transformers.XLNetTokenizer) tokeniza nuestro texto de ejemplo anterior de la siguiente manera: ```py >>> from transformers import XLNetTokenizer @@ -100,137 +90,77 @@ As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously ex ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] ``` -We'll get back to the meaning of those `"▁"` when we look at [SentencePiece](#sentencepiece). As one can see, -the rare word `"Transformers"` has been split into the more frequent subwords `"Transform"` and `"ers"`. +Hablaremos del significado de esos `"▁"` cuando veamos [SentencePiece](#sentencepiece). Como se puede ver, la palabra rara `"Transformers"` se ha dividido en las subpalabras más frecuentes `"Transform"` y `"ers"`. -Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization -algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained -on. +Ahora, veamos cómo funcionan los diferentes algoritmos de tokenización de subpalabras. Ten en cuenta que todos esos algoritmos de tokenización se basan en alguna forma de entrenamiento que usualmente se realiza en el corpus en el que se entrenará el modelo correspondiente. ### Byte-Pair Encoding (BPE) -Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et -al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into -words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), -[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses -spaCy and ftfy, to count the frequency of each word in the training corpus. +La Codificación por Pares de Bytes (BPE por sus siglas en inglés) fue introducida en [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE se basa en un pre-tokenizador que divide los datos de entrenamiento en palabras. La pre-tokenización puede ser tan simple como la tokenización por espacio, por ejemplo, [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2), [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta). La pre-tokenización más avanzada incluye la tokenización basada en reglas, por ejemplo, [XLM](https://huggingface.co/docs/transformers/en/model_doc/xlm), [FlauBERT](model_doc/flaubert) que utiliza Moses para la mayoría de los idiomas, o [GPT](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt) que utiliza spaCy y ftfy, para contar la frecuencia de cada palabra en el corpus de entrenamiento. -After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the -training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set -of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until -the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to -define before training the tokenizer. +Después de la pre-tokenización, se ha creado un conjunto de palabras únicas y ha determinado la frecuencia con la que cada palabra apareció en los datos de entrenamiento. A continuación, BPE crea un vocabulario base que consiste en todos los símbolos que aparecen en el conjunto de palabras únicas y aprende reglas de fusión para formar un nuevo símbolo a partir de dos símbolos del vocabulario base. Lo hace hasta que el vocabulario ha alcanzado el tamaño de vocabulario deseado. Tenga en cuenta que el tamaño de vocabulario deseado es un hiperparámetro que se debe definir antes de entrenar el tokenizador. -As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been -determined: +Por ejemplo, supongamos que después de la pre-tokenización, se ha determinado el siguiente conjunto de palabras, incluyendo su frecuencia: ``` ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) ``` -Consequently, the base vocabulary is `["b", "g", "h", "n", "p", "s", "u"]`. Splitting all words into symbols of the -base vocabulary, we obtain: +En consecuencia, el vocabulario base es `["b", "g", "h", "n", "p", "s", "u"]`. Dividiendo todas las palabras en símbolos del vocabulario base, obtenemos: ``` ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) ``` -BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In -the example above `"h"` followed by `"u"` is present _10 + 5 = 15_ times (10 times in the 10 occurrences of -`"hug"`, 5 times in the 5 occurrences of `"hugs"`). However, the most frequent symbol pair is `"u"` followed by -`"g"`, occurring _10 + 5 + 5 = 20_ times in total. Thus, the first merge rule the tokenizer learns is to group all -`"u"` symbols followed by a `"g"` symbol together. Next, `"ug"` is added to the vocabulary. The set of words then -becomes +Luego, BPE cuenta la frecuencia de cada par de símbolos posible y selecciona el par de símbolos que ocurre con más frecuencia. En el ejemplo anterior, `"h"` seguido de `"u"` está presente _10 + 5 = 15_ veces (10 veces en las 10 ocurrencias de `"hug"`, 5 veces en las 5 ocurrencias de `"hugs"`). Sin embargo, el par de símbolos más frecuente es `"u"` seguido de `"g"`, que ocurre _10 + 5 + 5 = 20_ veces en total. Por lo tanto, la primera regla de fusión que aprende el tokenizador es agrupar todos los símbolos `"u"` seguidos de un símbolo `"g"` juntos. A continuación, `"ug"` se agrega al vocabulario. El conjunto de palabras entonces se convierte en ``` ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) ``` -BPE then identifies the next most common symbol pair. It's `"u"` followed by `"n"`, which occurs 16 times. `"u"`, -`"n"` is merged to `"un"` and added to the vocabulary. The next most frequent symbol pair is `"h"` followed by -`"ug"`, occurring 15 times. Again the pair is merged and `"hug"` can be added to the vocabulary. +Seguidamente, BPE identifica el próximo par de símbolos más común. Es `"u"` seguido de `"n"`, que ocurre 16 veces. `"u"`, `"n"` se fusionan en `"un"` y se agregan al vocabulario. El próximo par de símbolos más frecuente es `"h"` seguido de `"ug"`, que ocurre 15 veces. De nuevo, el par se fusiona y `"hug"` se puede agregar al vocabulario. -At this stage, the vocabulary is `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` and our set of unique words -is represented as +En este momento, el vocabulario es `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` y nuestro conjunto de palabras únicas se representa como: ``` ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) ``` -Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied -to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, -the word `"bug"` would be tokenized to `["b", "ug"]` but `"mug"` would be tokenized as `["", "ug"]` since -the symbol `"m"` is not in the base vocabulary. In general, single letters such as `"m"` are not replaced by the -`""` symbol because the training data usually includes at least one occurrence of each letter, but it is likely -to happen for very special characters like emojis. +Suponiendo que el entrenamiento por Byte-Pair Encoding se detuviera en este punto, las reglas de combinación aprendidas se aplicarían entonces a nuevas palabras (siempre que esas nuevas palabras no incluyan símbolos que no estuvieran en el vocabulario base). Por ejemplo, la palabra `"bug"` se tokenizaría como `["b", "ug"]`, pero `"mug"` se tokenizaría como `["", "ug"]` ya que el símbolo `"m"` no está en el vocabulario base. En general, las letras individuales como `"m"` no se reemplazan por el símbolo `""` porque los datos de entrenamiento usualmente incluyen al menos una ocurrencia de cada letra, pero es probable que suceda para caracteres especiales como los emojis. -As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter -to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters -and chose to stop training after 40,000 merges. +Como se mencionó anteriormente, el tamaño del vocabulario, es decir, el tamaño del vocabulario base + el número de combinaciones, es un hiperparámetro que se debe elegir. Por ejemplo, [GPT](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt) tiene un tamaño de vocabulario de 40,478 ya que tienen 478 caracteres base y eligieron detener el entrenamiento después de 40,000 combinaciones. #### Byte-level BPE -A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are -considered as base characters. To have a better base vocabulary, [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) uses bytes -as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that -every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's -tokenizer can tokenize every text without the need for the symbol. [GPT-2](model_doc/gpt) has a vocabulary -size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned -with 50,000 merges. +Un vocabulario base que incluya todos los caracteres base posibles puede ser bastante extenso si, por ejemplo, se consideran todos los caracteres unicode como caracteres base. Para tener un vocabulario base mejor, [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) utiliza bytes como vocabulario base, lo que es un truco astuto para forzar el vocabulario base a ser de tamaño 256 mientras se asegura de que cada carácter base esté incluido en el vocabulario. Con algunas reglas adicionales para tratar con la puntuación, el tokenizador de GPT2 puede tokenizar cualquier texto sin la necesidad del símbolo ``. [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2) tiene un tamaño de vocabulario de 50,257, lo que corresponde a los 256 tokens base de bytes, un token especial de fin de texto y los símbolos aprendidos con 50,000 combinaciones. ### WordPiece -WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean -Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and is very similar to -BPE. WordPiece first initializes the vocabulary to include every character present in the training data and -progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent -symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. +WordPiece es el algoritmo de tokenización de subpalabras utilizado por [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert), [DistilBERT](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert) y [Electra](https://huggingface.co/docs/transformers/main/en/model_doc/electra). El algoritmo fue descrito en [Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) y es muy similar a BPE. WordPiece inicializa el vocabulario para incluir cada carácter presente en los datos de entrenamiento y aprende progresivamente un número determinado de reglas de fusión. A diferencia de BPE, WordPiece no elige el par de símbolos más frecuente, sino el que maximiza la probabilidad de los datos de entrenamiento una vez agregado al vocabulario. -So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is -equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by -its second symbol is the greatest among all symbol pairs. *E.g.* `"u"`, followed by `"g"` would have only been -merged if the probability of `"ug"` divided by `"u"`, `"g"` would have been greater than for any other symbol -pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it _loses_ by merging two symbols -to ensure it's _worth it_. +¿Qué significa esto exactamente? Refiriéndonos al ejemplo anterior, maximizar la probabilidad de los datos de entrenamiento es equivalente a encontrar el par de símbolos cuya probabilidad dividida entre las probabilidades de su primer símbolo seguido de su segundo símbolo es la mayor entre todos los pares de símbolos. *Ej.* `"u"` seguido de `"g"` solo habría sido combinado si la probabilidad de `"ug"` dividida entre `"u"` y `"g"` habría sido mayor que para cualquier otro par de símbolos. Intuitivamente, WordPiece es ligeramente diferente a BPE en que evalúa lo que _pierde_ al fusionar dos símbolos para asegurarse de que _valga la pena_. ### Unigram -Unigram is a subword tokenization algorithm introduced in [Subword Regularization: Improving Neural Network Translation -Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf). In contrast to BPE or -WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each -symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and -the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in -conjunction with [SentencePiece](#sentencepiece). +Unigram es un algoritmo de tokenización de subpalabras introducido en [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf). A diferencia de BPE o WordPiece, Unigram inicializa su vocabulario base con un gran número de símbolos y progresivamente recorta cada símbolo para obtener un vocabulario más pequeño. El vocabulario base podría corresponder, por ejemplo, a todas las palabras pre-tokenizadas y las subcadenas más comunes. Unigram no se utiliza directamente para ninguno de los modelos transformers, pero se utiliza en conjunto con [SentencePiece](#sentencepiece). -At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training -data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm -computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then -removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those -symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has -reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized. +En cada paso de entrenamiento, el algoritmo Unigram define una pérdida (a menudo definida como la probabilidad logarítmica) sobre los datos de entrenamiento dados el vocabulario actual y un modelo de lenguaje unigram. Luego, para cada símbolo en el vocabulario, el algoritmo calcula cuánto aumentaría la pérdida general si el símbolo se eliminara del vocabulario. Luego, Unigram elimina un porcentaje `p` de los símbolos cuyo aumento de pérdida es el más bajo (siendo `p` generalmente 10% o 20%), es decir, aquellos símbolos que menos afectan la pérdida general sobre los datos de entrenamiento. Este proceso se repite hasta que el vocabulario haya alcanzado el tamaño deseado. El algoritmo Unigram siempre mantiene los caracteres base para que cualquier palabra pueda ser tokenizada. -Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of -tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary: +Debido a que Unigram no se basa en reglas de combinación (en contraste con BPE y WordPiece), el algoritmo tiene varias formas de tokenizar nuevo texto después del entrenamiento. Por ejemplo, si un tokenizador Unigram entrenado exhibe el vocabulario: ``` ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], ``` -`"hugs"` could be tokenized both as `["hug", "s"]`, `["h", "ug", "s"]` or `["h", "u", "g", "s"]`. So which one -to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that -the probability of each possible tokenization can be computed after training. The algorithm simply picks the most -likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their -probabilities. +`"hugs"` podría ser tokenizado tanto como `["hug", "s"]`, `["h", "ug", "s"]` o `["h", "u", "g", "s"]`. ¿Cuál elegir? Unigram guarda la probabilidad de cada token en el corpus de entrenamiento junto con el vocabulario, para que la probabilidad de que cada posible tokenización pueda ser computada después del entrenamiento. El algoritmo simplemente elige la tokenización más probable en la práctica, pero también ofrece la posibilidad de muestrear una posible tokenización según sus probabilidades. -Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of -the words \\(x_{1}, \dots, x_{N}\\) and that the set of all possible tokenizations for a word \\(x_{i}\\) is -defined as \\(S(x_{i})\\), then the overall loss is defined as +Esas probabilidades están definidas por la pérdida en la que se entrena el tokenizador. Suponiendo que los datos de entrenamiento constan de las palabras \\(x_{1}, \dots, x_{N}\\) y que el conjunto de todas las posibles tokenizaciones para una palabra \\(x_{i}\\) se define como \\(S(x_{i})\\), entonces la pérdida general se define como: $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ @@ -238,17 +168,8 @@ $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) ### SentencePiece -All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to -separate words. However, not all languages use spaces to separate words. One possible solution is to use language -specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer). -To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and -detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input -as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram -algorithm to construct the appropriate vocabulary. +Todos los algoritmos de tokenización descritos hasta ahora tienen el mismo problema: se asume que el texto de entrada utiliza espacios para separar palabras. Sin embargo, no todos los idiomas utilizan espacios para separar palabras. Una posible solución es utilizar pre-tokenizadores específicos del idioma, *ej.* [XLM](https://huggingface.co/docs/transformers/en/model_doc/xlm) utiliza un pre-tokenizador específico para chino, japonés y tailandés. Para resolver este problema de manera más general, [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) trata el texto de entrada como una corriente de entrada bruta, por lo que incluye el espacio en el conjunto de caracteres para utilizar. Luego utiliza el algoritmo BPE o unigram para construir el vocabulario apropiado. -The [`XLNetTokenizer`] uses SentencePiece for example, which is also why in the example earlier the -`"▁"` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be -concatenated and `"▁"` is replaced by a space. +Por ejemplo, [`XLNetTokenizer`](https://huggingface.co/docs/transformers/en/model_doc/xlnet#transformers.XLNetTokenizer) utiliza SentencePiece, razón por la cual en el ejemplo anterior se incluyó el carácter `"▁"` en el vocabulario. Decodificar con SentencePiece es muy fácil, ya que todos los tokens pueden simplemente concatenarse y `"▁"` se reemplaza por un espacio. -All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models -using SentencePiece are [ALBERT](model_doc/albert), [XLNet](model_doc/xlnet), [Marian](model_doc/marian), and [T5](model_doc/t5). +Todos los modelos transformers de nuestra biblioteca que utilizan SentencePiece lo utilizan en combinación con Unigram. Ejemplos de los modelos que utilizan SentencePiece son [ALBERT](https://huggingface.co/docs/transformers/en/model_doc/albert), [XLNet](https://huggingface.co/docs/transformers/en/model_doc/xlnet), [Marian](https://huggingface.co/docs/transformers/en/model_doc/marian) y [T5](https://huggingface.co/docs/transformers/main/en/model_doc/t5). From 2ddebe850de4c5116aa8410f80ff1d6e36774927 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 21:13:35 -0800 Subject: [PATCH 09/12] run make fixup From 15a6a82ffcbf3c7ac5178104ec32278b51108dbe Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Thu, 30 May 2024 21:19:34 -0800 Subject: [PATCH 10/12] Remove .md in Transformer XL link --- docs/source/en/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/tokenizer_summary.md b/docs/source/en/tokenizer_summary.md index 90baaebc8fed95..c5f12dd20d20ed 100644 --- a/docs/source/en/tokenizer_summary.md +++ b/docs/source/en/tokenizer_summary.md @@ -73,7 +73,7 @@ As can be seen space and punctuation tokenization, as well as rule-based tokeniz punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization -usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transfo-xl.md) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! +usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transfo-xl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size From fbc8fb7cf27a55b5185992eafebb58f73039e0aa Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Fri, 31 May 2024 14:10:06 -0800 Subject: [PATCH 11/12] fix some link issues in es/ --- docs/source/es/tokenizer_summary.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/es/tokenizer_summary.md b/docs/source/es/tokenizer_summary.md index e93ddea7a16e27..7ff2fd44d7d41e 100644 --- a/docs/source/es/tokenizer_summary.md +++ b/docs/source/es/tokenizer_summary.md @@ -68,7 +68,7 @@ Aunque la tokenización de caracteres es muy simple y reduciría significativame Los algoritmos de tokenización de subpalabras se basan en el principio de que las palabras frecuentemente utilizadas no deberían dividirse en subpalabras más pequeñas, pero las palabras raras deberían descomponerse en subpalabras significativas. Por ejemplo, `"annoyingly"` podría considerarse una palabra rara y descomponerse en `"annoying"` y `"ly"`. Ambas `"annoying"` y `"ly"` como subpalabras independientes aparecerían con más frecuencia al mismo tiempo que se mantiene el significado de `"annoyingly"` por el significado compuesto de `"annoying"` y `"ly"`. Esto es especialmente útil en lenguas aglutinantes como el turco, donde puedes formar palabras complejas (casi) arbitrariamente largas concatenando subpalabras. -La tokenización de subpalabras permite al modelo tener un tamaño de vocabulario razonable mientras puede aprender representaciones contextuales independientes significativas. Además, la tokenización de subpalabras permite al modelo procesar palabras que nunca ha visto antes, descomponiéndolas en subpalabras conocidas. Por ejemplo, el tokenizador [`~transformers.BertTokenizer`](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer) tokeniza `"I have a new GPU!"` de la siguiente manera: +La tokenización de subpalabras permite al modelo tener un tamaño de vocabulario razonable mientras puede aprender representaciones contextuales independientes significativas. Además, la tokenización de subpalabras permite al modelo procesar palabras que nunca ha visto antes, descomponiéndolas en subpalabras conocidas. Por ejemplo, el tokenizador [BertTokenizer](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer) tokeniza `"I have a new GPU!"` de la siguiente manera: ```py >>> from transformers import BertTokenizer @@ -80,7 +80,7 @@ La tokenización de subpalabras permite al modelo tener un tamaño de vocabulari Debido a que estamos considerando el modelo sin mayúsculas, la oración se convirtió a minúsculas primero. Podemos ver que las palabras `["i", "have", "a", "new"]` están presentes en el vocabulario del tokenizador, pero la palabra `"gpu"` no. En consecuencia, el tokenizador divide `"gpu"` en subpalabras conocidas: `["gp" y "##u"]`. `"##"` significa que el resto del token debería adjuntarse al anterior, sin espacio (para decodificar o revertir la tokenización). -Como otro ejemplo, el tokenizador [`~transformers.XLNetTokenizer`](https://huggingface.co/docs/transformers/en/model_doc/xlnet#transformers.XLNetTokenizer) tokeniza nuestro texto de ejemplo anterior de la siguiente manera: +Como otro ejemplo, el tokenizador [XLNetTokenizer](https://huggingface.co/docs/transformers/en/model_doc/xlnet#transformers.XLNetTokenizer) tokeniza nuestro texto de ejemplo anterior de la siguiente manera: ```py >>> from transformers import XLNetTokenizer @@ -98,7 +98,7 @@ Ahora, veamos cómo funcionan los diferentes algoritmos de tokenización de subp ### Byte-Pair Encoding (BPE) -La Codificación por Pares de Bytes (BPE por sus siglas en inglés) fue introducida en [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE se basa en un pre-tokenizador que divide los datos de entrenamiento en palabras. La pre-tokenización puede ser tan simple como la tokenización por espacio, por ejemplo, [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2), [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta). La pre-tokenización más avanzada incluye la tokenización basada en reglas, por ejemplo, [XLM](https://huggingface.co/docs/transformers/en/model_doc/xlm), [FlauBERT](model_doc/flaubert) que utiliza Moses para la mayoría de los idiomas, o [GPT](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt) que utiliza spaCy y ftfy, para contar la frecuencia de cada palabra en el corpus de entrenamiento. +La Codificación por Pares de Bytes (BPE por sus siglas en inglés) fue introducida en [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE se basa en un pre-tokenizador que divide los datos de entrenamiento en palabras. La pre-tokenización puede ser tan simple como la tokenización por espacio, por ejemplo, [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2), [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta). La pre-tokenización más avanzada incluye la tokenización basada en reglas, por ejemplo, [XLM](https://huggingface.co/docs/transformers/en/model_doc/xlm), [FlauBERT](https://huggingface.co/docs/transformers/en/model_doc/flaubert) que utiliza Moses para la mayoría de los idiomas, o [GPT](https://huggingface.co/docs/transformers/en/model_doc/openai-gpt) que utiliza spaCy y ftfy, para contar la frecuencia de cada palabra en el corpus de entrenamiento. Después de la pre-tokenización, se ha creado un conjunto de palabras únicas y ha determinado la frecuencia con la que cada palabra apareció en los datos de entrenamiento. A continuación, BPE crea un vocabulario base que consiste en todos los símbolos que aparecen en el conjunto de palabras únicas y aprende reglas de fusión para formar un nuevo símbolo a partir de dos símbolos del vocabulario base. Lo hace hasta que el vocabulario ha alcanzado el tamaño de vocabulario deseado. Tenga en cuenta que el tamaño de vocabulario deseado es un hiperparámetro que se debe definir antes de entrenar el tokenizador. From 01e5f82e82058efae0446d216b5853dba99338a4 Mon Sep 17 00:00:00 2001 From: Aaron Jimenez Date: Mon, 3 Jun 2024 16:07:36 -0800 Subject: [PATCH 12/12] fix typo --- docs/source/es/tokenizer_summary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/es/tokenizer_summary.md b/docs/source/es/tokenizer_summary.md index 7ff2fd44d7d41e..c4c8ee1783b251 100644 --- a/docs/source/es/tokenizer_summary.md +++ b/docs/source/es/tokenizer_summary.md @@ -22,7 +22,7 @@ En esta página, veremos más de cerca la tokenización. -Como vimos en [el tutorial de preprocessamiento](preprocessing), tokenizar un texto es dividirlo en palabras o subpalabras, que luego se convierten en indices o ids a través de una tabla de búsqueda. Convertir palabras o subpalabras en ids es sencillo, así que en esta descripción general, nos centraremos en dividir un texto en palabras o subpalabras (es decir, tokenizar un texto). Más específicamente, examinaremos los tres principales tipos de tokenizadores utilizados en 🤗 Transformers: [Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece) y [SentencePiece](#sentencepiece), y mostraremos ejemplos de qué tipo de tokenizador se utiliza en cada modelo. +Como vimos en [el tutorial de preprocesamiento](preprocessing), tokenizar un texto es dividirlo en palabras o subpalabras, que luego se convierten en indices o ids a través de una tabla de búsqueda. Convertir palabras o subpalabras en ids es sencillo, así que en esta descripción general, nos centraremos en dividir un texto en palabras o subpalabras (es decir, tokenizar un texto). Más específicamente, examinaremos los tres principales tipos de tokenizadores utilizados en 🤗 Transformers: [Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece) y [SentencePiece](#sentencepiece), y mostraremos ejemplos de qué tipo de tokenizador se utiliza en cada modelo. Ten en cuenta que en las páginas de los modelos, puedes ver la documentación del tokenizador asociado para saber qué tipo de tokenizador se utilizó en el modelo preentrenado. Por ejemplo, si miramos [BertTokenizer](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer), podemos ver que dicho modelo utiliza [WordPiece](#wordpiece).