Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return pytorch tensors like in transformers? #1578

Closed
PaulLerner opened this issue Jul 23, 2024 · 5 comments
Closed

return pytorch tensors like in transformers? #1578

PaulLerner opened this issue Jul 23, 2024 · 5 comments

Comments

@PaulLerner
Copy link

Hi,

Sorry in advance because I'm feel like I'm missing something here.
The Tokenizer from tokenizers seems to be having all of the same features as transformers.PreTrainedTokenizer except for one: return_tensors="pt"

>>> batch = tokenizer.encode_batch(["foo","barbaz"], add_special_tokens=True)
>>> batch
[Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

So, of course, I could convert them myself to pytorch Tensor from Encoding.ids like so but:

  • this adds a lot of boilerplate
  • not efficient to go through python List[int]
>>> torch.tensor([item.ids for item in batch])
tensor([[ 1, 23,  9,  9,  2,  0,  0,  0],
        [ 1, 22,  8,  7, 22,  8, 30,  2]])

Or should I use tokenizers only for training the tokenizer and then switch to transformers.PreTrainedTokenizer for inference?

Best,

Paul

(In case you're wondering, this is a character-level tokenizer, that's why the sequences are so long in my example)

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jul 23, 2024

In general, tokenizers pairs really well with tranformers. But if this features is requested I don't mind adding support for that! But it's gonna be a python only "layer" in the sense that I don't think there is a rust torch type

@PaulLerner
Copy link
Author

thanks for the quick answer! So I guess I'll just use PreTrainedTokenizerFast
(N.B. don't use from_pretrained but init with tokenizer_file https://huggingface.co/docs/transformers/v4.42.0/en/fast_tokenizers)

@vandrw
Copy link

vandrw commented Jul 23, 2024

@ArthurZucker There are Rust bindings for Torch, as seen here, but it is a bit more finicky to use. There are currently some issues with creating a Python package that uses it, since it requires setting up various flags that require a certain PyTorch version.

Another way to avoid using python lists, while supporting the latest PyTorch version, is to return NumPy arrays, and then call torch.from_numpy() in Python. This, in my experience, is faster than creating tensors from regular python lists.

@ArthurZucker
Copy link
Collaborator

we do support input sequence that are nympy of strings, it's a matter of converting into a single encoding, where you keep the offset and batch / same for tokens etc

@ArthurZucker
Copy link
Collaborator

Currently more focused on getting tokenizers to tiktoken speed (which is soon gonna be the case see #1560) but contributions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants