Skip to content

Commit

Permalink
Merge branch 'main' into change-ci
Browse files Browse the repository at this point in the history
  • Loading branch information
ArthurZucker authored May 2, 2024
2 parents 727e9b6 + 5cf3e6b commit 30f712d
Show file tree
Hide file tree
Showing 122 changed files with 1,977 additions and 2,599 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ on:
pull_request:
paths:
- "src/transformers/models/*/modeling_*.py"
- "tests/models/*/test_*.py"

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

env:
HF_HOME: /mnt/cache
Expand All @@ -20,31 +25,46 @@ env:
CUDA_VISIBLE_DEVICES: 0,1

jobs:
check_for_new_model:
find_models_to_run:
runs-on: ubuntu-22.04
name: Check if a PR is a new model PR
name: Find models to run slow tests
# Triggered only if the required label `run-slow` is added
if: ${{ contains(github.event.pull_request.labels.*.name, 'run-slow') }}
outputs:
new_model: ${{ steps.check_new_model.outputs.new_model }}
models: ${{ steps.models_to_run.outputs.models }}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: "0"
ref: ${{ github.event.pull_request.head.sha }}

- name: Get commit message
run: |
echo "commit_message=$(git show -s --format=%s)" >> $GITHUB_ENV
- name: Check if there is a new model
id: check_new_model
- name: Get models to run slow tests
run: |
echo "${{ env.commit_message }}"
python -m pip install GitPython
echo "new_model=$(python utils/check_if_new_model_added.py | tail -n 1)" >> $GITHUB_OUTPUT
python utils/pr_slow_ci_models.py --commit_message "${{ env.commit_message }}" | tee output.txt
echo "models=$(tail -n 1 output.txt)" >> $GITHUB_ENV
- name: Models to run slow tests
id: models_to_run
run: |
echo "${{ env.models }}"
echo "models=${{ env.models }}" >> $GITHUB_OUTPUT
run_models_gpu:
name: Run all tests for the new model
# Triggered if it is a new model PR and the required label is added
if: ${{ needs.check_for_new_model.outputs.new_model != '' && contains(github.event.pull_request.labels.*.name, 'single-model-run-slow') }}
needs: check_for_new_model
name: Run all tests for the model
# Triggered only `find_models_to_run` is triggered (label `run-slow` is added) which gives the models to run
# (either a new model PR or via a commit message)
if: ${{ needs.find_models_to_run.outputs.models != '[]' }}
needs: find_models_to_run
strategy:
fail-fast: false
matrix:
folders: ["${{ needs.check_for_new_model.outputs.new_model }}"]
folders: ${{ fromJson(needs.find_models_to_run.outputs.models) }}
machine_type: [single-gpu, multi-gpu]
runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, ci]
container:
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/internal/generation_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,3 +362,4 @@ A [`Constraint`] can be used to force the generation to include specific tokens
[[autodoc]] StaticCache
- update
- get_seq_length
- reorder_cache
65 changes: 38 additions & 27 deletions docs/source/en/llm_optims.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,12 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
['The theory of special relativity states 1. The speed of light is constant in all inertial reference']
```

</hfoption>
<hfoption id="setup_cache">
Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. However, if the batch size or the maximum output length increase between calls, the cache will have to be reinitialized, triggering a new compilation.

> [!WARNING]
> The `_setup_cache` method is an internal and private method that is still under development. This means it may not be backward compatible and the API design may change in the future.
</hfoption>
<hfoption id="Static Cache">

The `_setup_cache` method doesn't support [`~GenerationMixin.generate`] yet, so this method is a bit more involved. You'll need to write your own function to decode the next token given the current token and position and cache position of previously generated tokens.
A [`StaticCache`] object can be passed to the model's forward pass under the `past_key_values` argument, enabling the use of this object as a static kv-cache. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. You can also pass the [`StaticCache`] object to [`~GenerationMixin.generate`] and use it across calls, like you would do with a dynamic cache.

```py
from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
Expand All @@ -90,17 +89,22 @@ tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

def decode_one_tokens(model, cur_token, input_pos, cache_position):
def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_values):
logits = model(
cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True
cur_token,
position_ids=input_pos,
cache_position=cache_position,
past_key_values=past_key_values,
return_dict=False,
use_cache=True
)[0]
new_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
return new_token
```

There are a few important things you must do to enable static kv-cache and torch.compile with the `_setup_cache` method:
There are a few important things you must do to enable static kv-cache and torch.compile with the `StaticCache` method:

1. Access the model's `_setup_cache` method and pass it the [`StaticCache`] class. This is a more flexible method because it allows you to configure parameters like the maximum batch size and sequence length.
1. Initialize the [`StaticCache`] instance before using the model for inference. There you can configure parameters like the maximum batch size and sequence length.

2. Call torch.compile on the model to compile the forward pass with the static kv-cache.

Expand All @@ -109,31 +113,38 @@ There are a few important things you must do to enable static kv-cache and torch
```py
batch_size, seq_length = inputs["input_ids"].shape
with torch.no_grad():
model._setup_cache(StaticCache, 2, max_cache_len=4096)
cache_position = torch.arange(seq_length, device=torch_device)
generated_ids = torch.zeros(
batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
)
generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)

logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
generated_ids[:, seq_length] = next_token[:, 0]

decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
cache_position = torch.tensor([seq_length + 1], device=torch_device)
for _ in range(1, NUM_TOKENS_TO_GENERATE):
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
next_token = decode_one_tokens(model, next_token.clone(), None, cache_position)
generated_ids[:, cache_position] = next_token.int()
cache_position += 1
past_key_values = StaticCache(
config=model.config, max_batch_size=2, max_cache_len=4096, device=torch_device, dtype=model.dtype
)
cache_position = torch.arange(seq_length, device=torch_device)
generated_ids = torch.zeros(
batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
)
generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)

logits = model(
**inputs, cache_position=cache_position, past_key_values=past_key_values,return_dict=False, use_cache=True
)[0]
next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
generated_ids[:, seq_length] = next_token[:, 0]

decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
cache_position = torch.tensor([seq_length + 1], device=torch_device)
for _ in range(1, NUM_TOKENS_TO_GENERATE):
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
next_token = decode_one_tokens(model, next_token.clone(), None, cache_position, past_key_values)
generated_ids[:, cache_position] = next_token.int()
cache_position += 1

text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
text
['Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.',
'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p']
```

> [!TIP]
> If you want to reuse the [`StaticCache`] object on a new prompt, be sure to reset its contents with the `.reset()` method
</hfoption>
</hfoptions>

Expand Down
19 changes: 11 additions & 8 deletions docs/source/en/llm_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,10 +247,11 @@ While the autoregressive generation process is relatively straightforward, makin

### Advanced generate usage

1. [Guide](generation_strategies) on how to control different generation methods, how to set up the generation configuration file, and how to stream the output;
2. [Guide](chat_templating) on the prompt template for chat LLMs;
3. [Guide](tasks/prompting) on to get the most of prompt design;
4. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples!
1. Guide on how to [control different generation methods](generation_strategies), how to set up the generation configuration file, and how to stream the output;
2. [Accelerating text generation](llm_optims);
3. [Prompt templates for chat LLMs](chat_templating);
4. [Prompt design guide](tasks/prompting);
5. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples!

### LLM leaderboards

Expand All @@ -259,10 +260,12 @@ While the autoregressive generation process is relatively straightforward, makin

### Latency, throughput and memory utilization

1. [Guide](llm_tutorial_optimization) on how to optimize LLMs for speed and memory;
2. [Guide](main_classes/quantization) on quantization such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements.
1. Guide on how to [optimize LLMs for speed and memory](llm_tutorial_optimization);
2. Guide on [quantization](main_classes/quantization) such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements.

### Related libraries

1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs;
2. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices.
1. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices.
2. [`outlines`](https://github.com/outlines-dev/outlines), a library where you can constrain text generation (e.g. to generate JSON files);
3. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs;
4. [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui), a UI for text generation;
23 changes: 14 additions & 9 deletions docs/source/en/tasks/object_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,11 @@ To see all architectures and checkpoints compatible with this task, we recommend
Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install -q datasets transformers evaluate timm albumentations
pip install -q datasets transformers accelerate evaluate albumentations
```

You'll use 🤗 Datasets to load a dataset from the Hugging Face Hub, 🤗 Transformers to train your model,
and `albumentations` to augment the data. `timm` is currently required to load a convolutional backbone for the DETR model.
and `albumentations` to augment the data.

We encourage you to share your model with the community. Log in to your Hugging Face account to upload it to the Hub.
When prompted, enter your token to log in:
Expand Down Expand Up @@ -342,6 +342,7 @@ and `id2label` maps that you created earlier from the dataset's metadata. Additi
... id2label=id2label,
... label2id=label2id,
... ignore_mismatched_sizes=True,
... revision="no_timm", # DETR models can be loaded without timm
... )
```

Expand All @@ -357,7 +358,7 @@ Face to upload your model).
>>> training_args = TrainingArguments(
... output_dir="detr-resnet-50_finetuned_cppe5",
... per_device_train_batch_size=8,
... num_train_epochs=10,
... num_train_epochs=100,
... fp16=True,
... save_steps=200,
... logging_steps=50,
Expand Down Expand Up @@ -487,10 +488,10 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
... return {"pixel_values": pixel_values, "labels": target}


>>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")
>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")

>>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"])
>>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno)
>>> test_ds_coco_format = CocoDetection(path_output_cppe5, image_processor, path_anno)
```

Finally, load the metrics and run the evaluation.
Expand All @@ -505,10 +506,13 @@ Finally, load the metrics and run the evaluation.
... test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn
... )

>>> device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
>>> model.to(device)

>>> with torch.no_grad():
... for idx, batch in enumerate(tqdm(val_dataloader)):
... pixel_values = batch["pixel_values"]
... pixel_mask = batch["pixel_mask"]
... pixel_values = batch["pixel_values"].to(device)
... pixel_mask = batch["pixel_mask"].to(device)

... labels = [
... {k: v for k, v in t.items()} for t in batch["labels"]
Expand All @@ -518,8 +522,9 @@ Finally, load the metrics and run the evaluation.
... outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask)

... orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0)
... results = im_processor.post_process(outputs, orig_target_sizes) # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax)

... # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax)
... results = image_processor.post_process_object_detection(outputs, threshold=0, target_sizes=orig_target_sizes)
...
... module.add(prediction=results, reference=labels)
... del batch

Expand Down
10 changes: 5 additions & 5 deletions docs/source/en/tasks/semantic_segmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,15 +60,15 @@ image
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg" alt="Segmentation Input"/>
</div>

We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024).
We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024).

```python
semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024")
results = semantic_segmentation(image)
results
```

The segmentation pipeline output includes a mask for every predicted class.
The segmentation pipeline output includes a mask for every predicted class.
```bash
[{'score': None,
'label': 'road',
Expand Down Expand Up @@ -111,11 +111,11 @@ results[-1]["mask"]
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/semantic_segmentation_output.png" alt="Semantic Segmentation Output"/>
</div>

In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this.
In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this.

```python
instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance")
results = instance_segmentation(Image.open(image))
results = instance_segmentation(image)
results
```

Expand Down Expand Up @@ -148,7 +148,7 @@ Panoptic segmentation combines semantic segmentation and instance segmentation,

```python
panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic")
results = panoptic_segmentation(Image.open(image))
results = panoptic_segmentation(image)
results
```
As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/es/converting_tensorflow_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Aquí hay un ejemplo del proceso para convertir un modelo OpenAI GPT-2 pre-entre
```bash
export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/openai-community/gpt2/pretrained/weights

transformers-cli convert --model_type openai-community/gpt2 \
transformers-cli convert --model_type gpt2 \
--tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
--pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
[--config OPENAI_GPT2_CONFIG] \
Expand Down
Loading

0 comments on commit 30f712d

Please sign in to comment.