-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Model-based Tokenizer for Text Chunking #794
Comments
The current approach is using the formula But the output of the chunking process should only rely on the mapping specification of the target field. This can be anything, it's the choice of the user. I would say that specifying the model should only be used to determine the real chunk length, relying on the model specific tokenizer rules, enabling to determine the chunk length with a real token limit. Regarding the options, I would prefer option 2, because it enables every deployed model (I use LLM models not in the preconfigured OpenSearch list), and frees me to specify an extra tokenizer. I currently can't think of a use case where someone would want to chunk according to a specific model and not use it afterward. Question: What about models that are not uploaded directly into Opensearch, but with an underlying connector (as in the LLM case). Are there the tokenizing rules known also? |
If OpenSearch is using a remote model for downstream, option 2 is no longer valid unless we deploy another kind of connector in ml-commons. For remote model, can we choose option 3 so that users can enable the tokenizer with the tokenizer file. |
Although it is a valid estimation, the word tokenizer in OpenSearch may still produce longer texts than the token limit of the text embedding model. We are considering implementing model-based tokenizer as an approach so that there would be neither information loss or extra chunks. |
The tokenization inside the chunker is only an internal process, right? It is not necessary to specify this tokenizer on the target field. I'm a bit confused because of the pros and cons, where a con is the output of the tokenizer, but this output is never seen by the user, or am I wrong?" |
You are right. The tokenization is orthogonal to the target field. Ideally, the user should be able to specify any tokenizer on any existing target field. |
Are you referring to this con: It is the reformatting problem. Suppose we are using fixed token length algorithm and the token limit is set to 1. The ideal output should be either |
@yuye-aws Why assign this to me? I don't have plan to implement this feature currently |
Because you have done lots of investigation on model-based tokenizer. Assigning this to you does not mean that you need to immediately work on this feature. |
Since OpenSearch 2.13, fixed token length algorithm is available in text chunking processor. For fixed token length algorithm, users can specify the token limit for each chunked passages. A common use case for text chunking processor is to append a text embedding processor. With text chunking processor, users can circumvent the information loss due to truncation from downstream text embedding models.
By OpenSearch 2.15, fixed token length algorithm only supports word tokenizers. The text embedding models perform truncation on long texts if exceeding token limit by their own tokenizer. Given the disparity between word tokenizers and model-based tokenizers, It is hard for users to assign a perfect value to parameter for token_limit. We are initiating this RFC to solicit feedbacks from the community to determine whether and how to implement model-based tokenizer for fixed token length algorithm.
Introduction
Tokenization is the process of segmenting a string into a list of individual tokens. Prior to text embedding, language models perform tokenization on the input texts. Each language model have its own Model-based tokenizers.
The tokenization results varies across different tokenizers. We showcase the difference between word tokenizers and model-based tokenizers with a simple example. The same input string will be tokenized with standard tokenizer and the tokenizer from model
sentence-transformers/msmarco-distilbert-base-tas-b
.where
[CLS]
indicates the beginning of a sentence and[SEP]
splits sentences. As we can see from the example above, the tokens returned by the standard tokenizer and the model-based tokenizer are quite different. The standard tokenizer returns 12 tokens and the model-based tokenizer returns 20 tokens.In our first release, we can start with the tokenizers for OpenSearch-provided pretrained models. These models usually do not share the same vocabulary corpus and tokenizer. For disambiguation, we need to support a dedicated tokenizer for each of the following model.
Sentence transformers
Sparse encoding models
Cross-encoder models
Pros and cons
Here are the pros and cons for model-based tokenizers:
Pros
Cons
API
There are two options to use model-based tokenizer in fixed token length algorithm. Please note that we can support both of them. For example, we can implement one option in the first release and then support another option later.
Option 1
Specify tokenizer with pretrained model name.
Pros
Cons
Option 2
After deploying a text embedding model, users can assign the model id to tokenizer.
Pros
Cons
Option 3
Unlike text embedding models, the tokenizer only needs files like tokenizer.json, tokenizer_config.json and vocab.txt. Following the behavior of registering models in ml-commons plugin, users can register their tokenizer without the model weights.
Pros
Cons
Open questions
The text was updated successfully, but these errors were encountered: