Skip to content

Commit

Permalink
Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour (
Browse files Browse the repository at this point in the history
…huggingface#33390)

* added doc explaining behaviour regarding tokens timestamps and previous tokens

* copied changes to faster tokenizer

---------

Co-authored-by: Bruno Hays <[email protected]>
  • Loading branch information
bruno-hays and Bruno Hays authored Sep 10, 2024
1 parent 6ed2b10 commit dfee4f2
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
6 changes: 4 additions & 2 deletions src/transformers/models/whisper/tokenization_whisper.py
Original file line number Diff line number Diff line change
Expand Up @@ -673,13 +673,15 @@ def decode(
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
Whether or not to remove special tokens in the decoding. Will remove the previous tokens (pre-prompt)
if present.
clean_up_tokenization_spaces (`bool`, *optional*):
Whether or not to clean up the tokenization spaces. If `None`, will default to
`self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps.
timestamps. If there are previous tokens (pre-prompt) to decode, they will only appear in the decoded
text if they contain timestamp tokens.
time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`):
Expand Down
6 changes: 4 additions & 2 deletions src/transformers/models/whisper/tokenization_whisper_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,13 +319,15 @@ def decode(
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
Whether or not to remove special tokens in the decoding. Will remove the previous tokens (pre-prompt)
if present.
clean_up_tokenization_spaces (`bool`, *optional*):
Whether or not to clean up the tokenization spaces. If `None`, will default to
`self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps.
timestamps. If there are previous tokens (pre-prompt) to decode, they will only appear in the decoded
text if they contain timestamp tokens.
time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`):
Expand Down

0 comments on commit dfee4f2

Please sign in to comment.