Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
muellerzr authored Oct 25, 2024
2 parents 1eab209 + 1d06379 commit cdbb3d3
Show file tree
Hide file tree
Showing 44 changed files with 156 additions and 991 deletions.
6 changes: 3 additions & 3 deletions docs/source/en/agents_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ manager_agent.run("Who is the CEO of Hugging Face?")

Let's take again the tool example from main documentation, for which we had implemented a `tool` decorator.

If you need to add variation, like custom attributes for your too, you can build your tool following the fine-grained method: building a class that inherits from the [`Tool`] superclass.
If you need to add variation, like custom attributes for your tool, you can build your tool following the fine-grained method: building a class that inherits from the [`Tool`] superclass.

The custom tool needs:
- An attribute `name`, which corresponds to the name of the tool itself. The name usually describes what the tool does. Since the code returns the model with the most downloads for a task, let's name is `model_download_counter`.
- An attribute `name`, which corresponds to the name of the tool itself. The name usually describes what the tool does. Since the code returns the model with the most downloads for a task, let's name it `model_download_counter`.
- An attribute `description` is used to populate the agent's system prompt.
- An `inputs` attribute, which is a dictionary with keys `"type"` and `"description"`. It contains information that helps the Python interpreter make educated choices about the input.
- An `output_type` attribute, which specifies the output type.
Expand Down Expand Up @@ -240,4 +240,4 @@ with gr.Blocks() as demo:

if __name__ == "__main__":
demo.launch()
```
```
4 changes: 1 addition & 3 deletions docs/source/en/internal/generation_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -428,13 +428,11 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- __call__

[[autodoc]] BayesianDetectorConfig
- __call__

[[autodoc]] BayesianDetectorModel
- __call__
- forward

[[autodoc]] SynthIDTextWatermarkingConfig
- __call__

[[autodoc]] SynthIDTextWatermarkDetector
- __call__
20 changes: 14 additions & 6 deletions src/transformers/generation/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,15 @@ class GenerationConfig(PushToHubMixin):
speed up decoding.
cache_implementation (`str`, *optional*, default to `None`):
Name of the cache class that will be instantiated in `generate`, for faster decoding. Possible values are:
{ALL_CACHE_IMPLEMENTATIONS}. We support other cache types, but they must be manually instantiated and
- `"static"`: [`StaticCache`]
- `"offloaded_static"`: [`OffloadedStaticCache`]
- `"sliding_window"`: [`SlidingWindowCache`]
- `"hybrid"`: [`HybridCache`]
- `"mamba"`: [`MambaCache`]
- `"quantized"`: [`QuantizedCache`]
We support other cache types, but they must be manually instantiated and
passed to `generate` through the `past_key_values` argument. See our
[cache documentation](https://huggingface.co/docs/transformers/en/kv_cache) for further information.
cache_config (`CacheConfig` or `dict`, *optional*, default to `None`):
Expand Down Expand Up @@ -1471,8 +1479,8 @@ class SynthIDTextWatermarkingConfig(BaseWatermarkingConfig):
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig
>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b-it')
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b-it')
>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')
>>> # SynthID Text configuration
>>> watermarking_config = SynthIDTextWatermarkingConfig(
Expand All @@ -1481,11 +1489,11 @@ class SynthIDTextWatermarkingConfig(BaseWatermarkingConfig):
... )
>>> # Generation with watermarking
>>> tokenized_prompts = tokenizer(["your prompts here"])
>>> tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
>>> output_sequences = model.generate(
... **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True,
... **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
... )
>>> watermarked_text = tokenizer.batch_decode(output_sequences)
>>> watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
```
"""

Expand Down
10 changes: 5 additions & 5 deletions src/transformers/generation/logits_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -2565,8 +2565,8 @@ class SynthIDTextWatermarkLogitsProcessor(LogitsProcessor):
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig
>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b-it')
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b-it')
>>> tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
>>> model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')
>>> # SynthID Text configuration
>>> watermarking_config = SynthIDTextWatermarkingConfig(
Expand All @@ -2575,11 +2575,11 @@ class SynthIDTextWatermarkLogitsProcessor(LogitsProcessor):
... )
>>> # Generation with watermarking
>>> tokenized_prompts = tokenizer(["your prompts here"])
>>> tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
>>> output_sequences = model.generate(
... **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True,
... **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
... )
>>> watermarked_text = tokenizer.batch_decode(output_sequences)
>>> watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
```
"""

Expand Down
7 changes: 6 additions & 1 deletion src/transformers/models/llava/modeling_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,12 @@ def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, in
(batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
)
image_to_overwrite[batch_indices, text_to_overwrite] = False
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
if left_padding:
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
else:
mask = torch.ones_like(image_to_overwrite, dtype=torch.bool).cumsum(-1) - 1
padding_mask = mask <= new_token_positions[:, -1:].to(target_device)
image_to_overwrite &= padding_mask

if image_to_overwrite.sum() != image_features.shape[:-1].numel():
raise ValueError(
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/mimi/modeling_mimi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1156,7 +1156,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/mistral/modeling_mistral.py
Original file line number Diff line number Diff line change
Expand Up @@ -961,7 +961,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/mixtral/modeling_mixtral.py
Original file line number Diff line number Diff line change
Expand Up @@ -1174,7 +1174,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/moshi/modeling_moshi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1385,7 +1385,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down Expand Up @@ -1689,7 +1689,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/phi3/modeling_phi3.py
Original file line number Diff line number Diff line change
Expand Up @@ -1136,7 +1136,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/phimoe/modeling_phimoe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1305,7 +1305,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/qwen2/modeling_qwen2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1059,7 +1059,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/qwen2_moe/modeling_qwen2_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1239,7 +1239,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -1321,7 +1321,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/starcoder2/modeling_starcoder2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1033,7 +1033,7 @@ def _prepare_4d_causal_attention_mask_with_cache_position(
sliding_attend_mask = torch.arange(target_length, device=device) <= (
cache_position.reshape(-1, 1) - config.sliding_window
)
diagonal_attend_mask |= sliding_attend_mask
diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
causal_mask *= diagonal_attend_mask
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
Expand Down
7 changes: 6 additions & 1 deletion src/transformers/models/video_llava/modeling_video_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,12 @@ def _merge_input_ids_with_visual_features(
# 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
image_to_overwrite = torch.full((batch_size, max_seq_len), True, dtype=torch.bool, device=inputs_embeds.device)
image_to_overwrite[batch_indices, text_to_overwrite] = False
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
if left_padding:
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
else:
mask = torch.ones_like(image_to_overwrite, dtype=torch.bool).cumsum(-1) - 1
padding_mask = mask <= new_token_positions[:, -1:].to(target_device)
image_to_overwrite &= padding_mask

if image_to_overwrite.sum() != visual_features.shape[:-1].numel():
visual_type = "videos" if num_frames == 8 else "images"
Expand Down
7 changes: 6 additions & 1 deletion src/transformers/models/vipllava/modeling_vipllava.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,12 @@ def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, in
(batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
)
image_to_overwrite[batch_indices, text_to_overwrite] = False
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
if left_padding:
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
else:
mask = torch.ones_like(image_to_overwrite, dtype=torch.bool).cumsum(-1) - 1
padding_mask = mask <= new_token_positions[:, -1:].to(target_device)
image_to_overwrite &= padding_mask

if image_to_overwrite.sum() != image_features.shape[:-1].numel():
raise ValueError(
Expand Down
82 changes: 82 additions & 0 deletions tests/generation/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@


import copy
import gc
import inspect
import tempfile
import unittest
Expand All @@ -33,6 +34,7 @@
require_torch_gpu,
require_torch_multi_accelerator,
require_torch_multi_gpu,
require_torch_sdpa,
slow,
torch_device,
)
Expand Down Expand Up @@ -2046,6 +2048,86 @@ def test_inherits_generation_mixin(self):
for model_class in self.all_generative_model_classes:
self.assertTrue("GenerationMixin" in str(model_class.__bases__))

@require_torch_sdpa
@slow
def test_eager_matches_sdpa_generate(self):
max_new_tokens = 30

for model_class in self.all_generative_model_classes:
if not model_class._supports_sdpa:
self.skipTest(f"{model_class.__name__} does not support SDPA")

config, original_inputs_dict = self.prepare_config_and_inputs_for_generate()
inputs_dict = {}
for input_name, input_data in original_inputs_dict.items():
if isinstance(input_data, torch.Tensor) and input_data.dtype in [torch.float32, torch.bfloat16]:
inputs_dict[input_name] = input_data.to(torch.float16)
else:
inputs_dict[input_name] = input_data
main_input = inputs_dict[model_class.main_input_name]

# make sure that all models have enough positions for generation
if hasattr(config, "max_position_embeddings"):
config.max_position_embeddings = max_new_tokens + main_input.shape[1] + 1

model = model_class(config)

with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
del model
gc.collect()

generate_kwargs = {
"max_new_tokens": max_new_tokens,
"do_sample": False,
"return_dict_in_generate": True,
"output_scores": True,
}

model_sdpa = model_class.from_pretrained(
tmpdirname,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(torch_device)
res_sdpa = model_sdpa.generate(**inputs_dict, **generate_kwargs)
del model_sdpa
gc.collect()

model_eager = model_class.from_pretrained(
tmpdirname,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="eager",
).to(torch_device)
res_eager = model_eager.generate(**inputs_dict, **generate_kwargs)
del model_eager
gc.collect()

# Eager and SDPA are very similar, but not exactly the same. Because we are using random models, this
# test would be flaky if we only checked the sequences. Two situations in which this test passes:
# 1. The sequences are the same
# 2. The sequences are different, but the scores up until the first mismatch are nearly identical
output_matches = res_eager.sequences == res_sdpa.sequences
has_matching_outputs = output_matches.all()
has_matching_scores = None
if not has_matching_outputs:
input_length = main_input.shape[1]
for batch_idx in range(res_eager.sequences.shape[0]):
batch_matches = output_matches[batch_idx]
if batch_matches.all():
continue
first_mismatch_idx = batch_matches.int().argmin() # gets the index of the first False
first_mismatch_idx -= input_length # scores doesn't include data regarding input tokens
sdpa_first_mismatch_scores = res_sdpa.scores[first_mismatch_idx][batch_idx]
eager_first_mismatch_scores = res_eager.scores[first_mismatch_idx][batch_idx]
has_matching_scores = torch.allclose(
sdpa_first_mismatch_scores, eager_first_mismatch_scores, rtol=1e-3, atol=1e-3
)
if not has_matching_scores:
break

self.assertTrue(has_matching_outputs or has_matching_scores)

def _check_outputs(self, output, main_input, config, use_cache=False, num_return_sequences=1):
# we can be sure what is batch size from main input but seq length depends on model type and whether input is text/audio/image
# so we infer actual text seq length from model_tester, same was as it is done in `test_modeling_common.py` tests`
Expand Down
Loading

0 comments on commit cdbb3d3

Please sign in to comment.