Merge branch 'main' into change-ci

huggingface · May 2, 2024 · 30f712d · 30f712d
2 parents 727e9b6 + 5cf3e6b
commit 30f712d
Show file tree

Hide file tree

Showing 122 changed files with 1,977 additions and 2,599 deletions.
diff --git a/...ub/workflows/self-new-model-pr-caller.yml → .github/workflows/self-pr-slow-ci.yml b/...ub/workflows/self-new-model-pr-caller.yml → .github/workflows/self-pr-slow-ci.yml
@@ -4,6 +4,11 @@ on:
   pull_request:
     paths:
       - "src/transformers/models/*/modeling_*.py"
+      - "tests/models/*/test_*.py"
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
 
 env:
   HF_HOME: /mnt/cache
@@ -20,31 +25,46 @@ env:
   CUDA_VISIBLE_DEVICES: 0,1
 
 jobs:
-  check_for_new_model:
+  find_models_to_run:
       runs-on: ubuntu-22.04
-      name: Check if a PR is a new model PR
+      name: Find models to run slow tests
+      # Triggered only if the required label `run-slow` is added
+      if: ${{ contains(github.event.pull_request.labels.*.name, 'run-slow') }}
       outputs:
-        new_model: ${{ steps.check_new_model.outputs.new_model }}
+        models: ${{ steps.models_to_run.outputs.models }}
       steps:
         - uses: actions/checkout@v4
           with:
             fetch-depth: "0"
+            ref: ${{ github.event.pull_request.head.sha }}
+
+        - name: Get commit message
+          run: |
+            echo "commit_message=$(git show -s --format=%s)" >> $GITHUB_ENV
 
-        - name: Check if there is a new model
-          id: check_new_model
+        - name: Get models to run slow tests
           run: |
+            echo "${{ env.commit_message }}"
             python -m pip install GitPython
-            echo "new_model=$(python utils/check_if_new_model_added.py | tail -n 1)" >> $GITHUB_OUTPUT
+            python utils/pr_slow_ci_models.py --commit_message "${{ env.commit_message }}" | tee output.txt
+            echo "models=$(tail -n 1 output.txt)" >> $GITHUB_ENV
+
+        - name: Models to run slow tests
+          id: models_to_run
+          run: |
+            echo "${{ env.models }}"
+            echo "models=${{ env.models }}" >> $GITHUB_OUTPUT
 
   run_models_gpu:
-      name: Run all tests for the new model
-      # Triggered if it is a new model PR and the required label is added
-      if: ${{ needs.check_for_new_model.outputs.new_model != '' && contains(github.event.pull_request.labels.*.name, 'single-model-run-slow') }}
-      needs: check_for_new_model
+      name: Run all tests for the model
+      # Triggered only `find_models_to_run` is triggered (label `run-slow` is added) which gives the models to run
+      # (either a new model PR or via a commit message)
+      if: ${{ needs.find_models_to_run.outputs.models != '[]' }}
+      needs: find_models_to_run
       strategy:
         fail-fast: false
         matrix:
-          folders: ["${{ needs.check_for_new_model.outputs.new_model }}"]
+          folders: ${{ fromJson(needs.find_models_to_run.outputs.models) }}
           machine_type: [single-gpu, multi-gpu]
       runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, ci]
       container:

diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md
@@ -362,3 +362,4 @@ A [`Constraint`] can be used to force the generation to include specific tokens
 [[autodoc]] StaticCache
     - update
     - get_seq_length
+    - reorder_cache
diff --git a/docs/source/en/llm_optims.md b/docs/source/en/llm_optims.md
@@ -65,13 +65,12 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ['The theory of special relativity states 1. The speed of light is constant in all inertial reference']
 ```
 
-</hfoption>
-<hfoption id="setup_cache">
+Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. However, if the batch size or the maximum output length increase between calls, the cache will have to be reinitialized, triggering a new compilation.
 
-> [!WARNING]
-> The `_setup_cache` method is an internal and private method that is still under development. This means it may not be backward compatible and the API design may change in the future.
+</hfoption>
+<hfoption id="Static Cache">
 
-The `_setup_cache` method doesn't support [`~GenerationMixin.generate`] yet, so this method is a bit more involved. You'll need to write your own function to decode the next token given the current token and position and cache position of previously generated tokens.
+A [`StaticCache`] object can be passed to the model's forward pass under the `past_key_values` argument, enabling the use of this object as a static kv-cache. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. You can also pass the [`StaticCache`] object to [`~GenerationMixin.generate`] and use it across calls, like you would do with a dynamic cache.
 
 ```py
 from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
@@ -90,17 +89,22 @@ tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token
 model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
 inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
 
-def decode_one_tokens(model, cur_token, input_pos, cache_position):
+def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_values):
     logits = model(
-        cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True
+        cur_token,
+        position_ids=input_pos,
+        cache_position=cache_position,
+        past_key_values=past_key_values,
+        return_dict=False,
+        use_cache=True
     )[0]
     new_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
     return new_token
 ```
 
-There are a few important things you must do to enable static kv-cache and torch.compile with the `_setup_cache` method:
+There are a few important things you must do to enable static kv-cache and torch.compile with the `StaticCache` method:
 
-1. Access the model's `_setup_cache` method and pass it the [`StaticCache`] class. This is a more flexible method because it allows you to configure parameters like the maximum batch size and sequence length.
+1. Initialize the [`StaticCache`] instance before using the model for inference. There you can configure parameters like the maximum batch size and sequence length.
 
 2. Call torch.compile on the model to compile the forward pass with the static kv-cache.
 
@@ -109,31 +113,38 @@ There are a few important things you must do to enable static kv-cache and torch
 ```py
 batch_size, seq_length = inputs["input_ids"].shape
 with torch.no_grad():
-     model._setup_cache(StaticCache, 2, max_cache_len=4096)
-     cache_position = torch.arange(seq_length, device=torch_device)
-     generated_ids = torch.zeros(
-         batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
-     )
-     generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)
-
-     logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
-     next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
-     generated_ids[:, seq_length] = next_token[:, 0]
-
-     decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
-     cache_position = torch.tensor([seq_length + 1], device=torch_device)
-     for _ in range(1, NUM_TOKENS_TO_GENERATE):
-         with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
-             next_token = decode_one_tokens(model, next_token.clone(), None, cache_position)
-             generated_ids[:, cache_position] = next_token.int()
-         cache_position += 1
+    past_key_values = StaticCache(
+        config=model.config, max_batch_size=2, max_cache_len=4096, device=torch_device, dtype=model.dtype
+    )
+    cache_position = torch.arange(seq_length, device=torch_device)
+    generated_ids = torch.zeros(
+        batch_size, seq_length + NUM_TOKENS_TO_GENERATE + 1, dtype=torch.int, device=torch_device
+    )
+    generated_ids[:, cache_position] = inputs["input_ids"].to(torch_device).to(torch.int)
+
+    logits = model(
+        **inputs, cache_position=cache_position, past_key_values=past_key_values,return_dict=False, use_cache=True
+    )[0]
+    next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
+    generated_ids[:, seq_length] = next_token[:, 0]
+
+    decode_one_tokens = torch.compile(decode_one_tokens, mode="reduce-overhead", fullgraph=True)
+    cache_position = torch.tensor([seq_length + 1], device=torch_device)
+    for _ in range(1, NUM_TOKENS_TO_GENERATE):
+        with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
+            next_token = decode_one_tokens(model, next_token.clone(), None, cache_position, past_key_values)
+            generated_ids[:, cache_position] = next_token.int()
+        cache_position += 1
 
 text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
 text
 ['Simply put, the theory of relativity states that 1) the speed of light is constant, 2) the speed of light is the same for all observers, and 3) the laws of physics are the same for all observers.',
  'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p']
 ```
 
+> [!TIP]
+> If you want to reuse the [`StaticCache`] object on a new prompt, be sure to reset its contents with the `.reset()` method
+
 </hfoption>
 </hfoptions>
 

diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md
@@ -247,10 +247,11 @@ While the autoregressive generation process is relatively straightforward, makin
 
 ### Advanced generate usage
 
-1. [Guide](generation_strategies) on how to control different generation methods, how to set up the generation configuration file, and how to stream the output;
-2. [Guide](chat_templating) on the prompt template for chat LLMs;
-3. [Guide](tasks/prompting) on to get the most of prompt design;
-4. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples!
+1. Guide on how to [control different generation methods](generation_strategies), how to set up the generation configuration file, and how to stream the output;
+2. [Accelerating text generation](llm_optims);
+3. [Prompt templates for chat LLMs](chat_templating);
+4. [Prompt design guide](tasks/prompting);
+5. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples!
 
 ### LLM leaderboards
 
@@ -259,10 +260,12 @@ While the autoregressive generation process is relatively straightforward, makin
 
 ### Latency, throughput and memory utilization
 
-1. [Guide](llm_tutorial_optimization) on how to optimize LLMs for speed and memory;
-2. [Guide](main_classes/quantization) on quantization such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements.
+1. Guide on how to [optimize LLMs for speed and memory](llm_tutorial_optimization);
+2. Guide on [quantization](main_classes/quantization) such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements.
 
 ### Related libraries
 
-1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs;
-2. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices.
+1. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices.
+2. [`outlines`](https://github.com/outlines-dev/outlines), a library where you can constrain text generation (e.g. to generate JSON files);
+3. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs;
+4. [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui), a UI for text generation;
diff --git a/docs/source/en/tasks/object_detection.md b/docs/source/en/tasks/object_detection.md
@@ -41,11 +41,11 @@ To see all architectures and checkpoints compatible with this task, we recommend
 Before you begin, make sure you have all the necessary libraries installed:
 
 ```bash
-pip install -q datasets transformers evaluate timm albumentations
+pip install -q datasets transformers accelerate evaluate albumentations
 ```
 
 You'll use 🤗 Datasets to load a dataset from the Hugging Face Hub, 🤗 Transformers to train your model,
-and `albumentations` to augment the data. `timm` is currently required to load a convolutional backbone for the DETR model.
+and `albumentations` to augment the data.
 
 We encourage you to share your model with the community. Log in to your Hugging Face account to upload it to the Hub.
 When prompted, enter your token to log in:
@@ -342,6 +342,7 @@ and `id2label` maps that you created earlier from the dataset's metadata. Additi
 ...     id2label=id2label,
 ...     label2id=label2id,
 ...     ignore_mismatched_sizes=True,
+...     revision="no_timm", # DETR models can be loaded without timm
 ... )
 ```
 
@@ -357,7 +358,7 @@ Face to upload your model).
 >>> training_args = TrainingArguments(
 ...     output_dir="detr-resnet-50_finetuned_cppe5",
 ...     per_device_train_batch_size=8,
-...     num_train_epochs=10,
+...     num_train_epochs=100,
 ...     fp16=True,
 ...     save_steps=200,
 ...     logging_steps=50,
@@ -487,10 +488,10 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
 ...         return {"pixel_values": pixel_values, "labels": target}
 
 
->>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")
+>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")
 
 >>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"])
->>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno)
+>>> test_ds_coco_format = CocoDetection(path_output_cppe5, image_processor, path_anno)
 ```
 
 Finally, load the metrics and run the evaluation.
@@ -505,10 +506,13 @@ Finally, load the metrics and run the evaluation.
 ...     test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn
 ... )
 
+>>> device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
+>>> model.to(device)
+
 >>> with torch.no_grad():
 ...     for idx, batch in enumerate(tqdm(val_dataloader)):
-...         pixel_values = batch["pixel_values"]
-...         pixel_mask = batch["pixel_mask"]
+...         pixel_values = batch["pixel_values"].to(device)
+...         pixel_mask = batch["pixel_mask"].to(device)
 
 ...         labels = [
 ...             {k: v for k, v in t.items()} for t in batch["labels"]
@@ -518,8 +522,9 @@ Finally, load the metrics and run the evaluation.
 ...         outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
 
 ...         orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0)
-...         results = im_processor.post_process(outputs, orig_target_sizes)  # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax)
-
+...         # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax)
+...         results = image_processor.post_process_object_detection(outputs, threshold=0, target_sizes=orig_target_sizes)  
+...
 ...         module.add(prediction=results, reference=labels)
 ...         del batch
 

diff --git a/docs/source/en/tasks/semantic_segmentation.md b/docs/source/en/tasks/semantic_segmentation.md
@@ -60,15 +60,15 @@ image
      <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg" alt="Segmentation Input"/>
 </div>
 
-We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024). 
+We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024).
 
 ```python
 semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024")
 results = semantic_segmentation(image)
 results
 ```
 
-The segmentation pipeline output includes a mask for every predicted class. 
+The segmentation pipeline output includes a mask for every predicted class.
 ```bash
 [{'score': None,
   'label': 'road',
@@ -111,11 +111,11 @@ results[-1]["mask"]
      <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/semantic_segmentation_output.png" alt="Semantic Segmentation Output"/>
 </div>
 
-In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this. 
+In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this.
 
 ```python
 instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance")
-results = instance_segmentation(Image.open(image))
+results = instance_segmentation(image)
 results
 ```
 
@@ -148,7 +148,7 @@ Panoptic segmentation combines semantic segmentation and instance segmentation,
 
 ```python
 panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic")
-results = panoptic_segmentation(Image.open(image))
+results = panoptic_segmentation(image)
 results
 ```
 As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes.

diff --git a/docs/source/es/converting_tensorflow_models.md b/docs/source/es/converting_tensorflow_models.md
@@ -89,7 +89,7 @@ Aquí hay un ejemplo del proceso para convertir un modelo OpenAI GPT-2 pre-entre
 ```bash
 export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/openai-community/gpt2/pretrained/weights
 
-transformers-cli convert --model_type openai-community/gpt2 \
+transformers-cli convert --model_type gpt2 \
   --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \
   --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \
   [--config OPENAI_GPT2_CONFIG] \