return pytorch tensors like in transformers? #1578

PaulLerner · 2024-07-23T13:21:15Z

Hi,

Sorry in advance because I'm feel like I'm missing something here.
The Tokenizer from tokenizers seems to be having all of the same features as transformers.PreTrainedTokenizer except for one: return_tensors="pt"

>>> batch = tokenizer.encode_batch(["foo","barbaz"], add_special_tokens=True)
>>> batch
[Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

So, of course, I could convert them myself to pytorch Tensor from Encoding.ids like so but:

this adds a lot of boilerplate
not efficient to go through python List[int]

>>> torch.tensor([item.ids for item in batch])
tensor([[ 1, 23,  9,  9,  2,  0,  0,  0],
        [ 1, 22,  8,  7, 22,  8, 30,  2]])

Or should I use tokenizers only for training the tokenizer and then switch to transformers.PreTrainedTokenizer for inference?

Best,

Paul

(In case you're wondering, this is a character-level tokenizer, that's why the sequences are so long in my example)

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-23T13:24:13Z

In general, tokenizers pairs really well with tranformers. But if this features is requested I don't mind adding support for that! But it's gonna be a python only "layer" in the sense that I don't think there is a rust torch type

PaulLerner · 2024-07-23T14:16:41Z

thanks for the quick answer! So I guess I'll just use PreTrainedTokenizerFast
(N.B. don't use from_pretrained but init with tokenizer_file https://huggingface.co/docs/transformers/v4.42.0/en/fast_tokenizers)

vandrw · 2024-07-23T14:21:04Z

@ArthurZucker There are Rust bindings for Torch, as seen here, but it is a bit more finicky to use. There are currently some issues with creating a Python package that uses it, since it requires setting up various flags that require a certain PyTorch version.

Another way to avoid using python lists, while supporting the latest PyTorch version, is to return NumPy arrays, and then call torch.from_numpy() in Python. This, in my experience, is faster than creating tensors from regular python lists.

ArthurZucker · 2024-07-24T20:13:07Z

we do support input sequence that are nympy of strings, it's a matter of converting into a single encoding, where you keep the offset and batch / same for tokens etc

ArthurZucker · 2024-07-24T20:13:43Z

Currently more focused on getting tokenizers to tiktoken speed (which is soon gonna be the case see #1560) but contributions are welcome!

PaulLerner closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return pytorch tensors like in transformers? #1578

return pytorch tensors like in transformers? #1578

PaulLerner commented Jul 23, 2024

ArthurZucker commented Jul 23, 2024 •

edited

Loading

PaulLerner commented Jul 23, 2024

vandrw commented Jul 23, 2024

ArthurZucker commented Jul 24, 2024

ArthurZucker commented Jul 24, 2024

return pytorch tensors like in transformers? #1578

return pytorch tensors like in transformers? #1578

Comments

PaulLerner commented Jul 23, 2024

ArthurZucker commented Jul 23, 2024 • edited Loading

PaulLerner commented Jul 23, 2024

vandrw commented Jul 23, 2024

ArthurZucker commented Jul 24, 2024

ArthurZucker commented Jul 24, 2024

ArthurZucker commented Jul 23, 2024 •

edited

Loading