add option to skip special tokens #1419

ArthurZucker · 2023-12-18T10:26:10Z

Allow skipping special tokens when encoding

HuggingFaceDocBuilderDev · 2023-12-18T10:31:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2023-12-18T15:07:03Z

This works as expected for now:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gp2") 
>>> tokenizer.tokenize("<|endoftext|>")
['<|endoftext|>']

>>> tokenizer._tokenizer.encode_special_tokens = True
>>> tokenizer.tokenize("<|endoftext|>")
['<', '|', 'end', 'of', 'text', '|', '>']

the goal is to support passing this as a kwargs, similarly to the slow!
This way you can both save it and activate it in a __call__.

ArthurZucker · 2023-12-18T15:33:45Z

TODO :

Evaluate on a benchmark if this does not slow down too much : good to go
Add tests
open a PR in transformers for a followup

…zers into encode-special-tokens

ArthurZucker · 2024-01-19T07:50:04Z

Before merging I just want to add a getter, and make sure we can just set it with tokenizer.encode_special_tokens = True.

ArthurZucker · 2024-01-19T11:24:56Z

PR is not on the correct branch lol

add option to skip special tokens

ed302a8

ArthurZucker mentioned this pull request Dec 18, 2023

How can we ignore special tokens when encoding text #1368

Closed

nits

a581a04

add api dummy for now

524acfe

ArthurZucker mentioned this pull request Jan 3, 2024

How to split special token in encode? #1391

Closed

Narsil and others added 6 commits January 18, 2024 16:47

Fmt.

262e9d2

Fix fmt.

820e88b

Fix the stub.

11e4ffc

add a test

33415e0

Merge branch 'encode-special-tokens' of github.com:huggingface/tokeni…

842eced

…zers into encode-special-tokens

add a test in python

7fb3c18

ArthurZucker added 7 commits January 19, 2024 09:48

style it

7ac2fab

nits

456e515

add getter and setters

5460db0

stub

bf1dbe6

update python test

af40ae2

fmt

ebbcc8e

last nit

f7beeba

ArthurZucker closed this Jan 19, 2024

ArthurZucker deleted the encode-special-tokens branch January 19, 2024 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add option to skip special tokens #1419

add option to skip special tokens #1419

ArthurZucker commented Dec 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 18, 2023

ArthurZucker commented Dec 18, 2023 •

edited

Loading

ArthurZucker commented Dec 18, 2023 •

edited

Loading

ArthurZucker commented Jan 19, 2024

ArthurZucker commented Jan 19, 2024

add option to skip special tokens #1419

add option to skip special tokens #1419

Conversation

ArthurZucker commented Dec 18, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Dec 18, 2023

ArthurZucker commented Dec 18, 2023 • edited Loading

ArthurZucker commented Dec 18, 2023 • edited Loading

ArthurZucker commented Jan 19, 2024

ArthurZucker commented Jan 19, 2024

ArthurZucker commented Dec 18, 2023 •

edited

Loading

ArthurZucker commented Dec 18, 2023 •

edited

Loading

ArthurZucker commented Dec 18, 2023 •

edited

Loading