Skip to content

Commit

Permalink
Merge branch 'main' into secure-amd-ci
Browse files Browse the repository at this point in the history
  • Loading branch information
ivarflakstad authored Dec 5, 2024
2 parents 32b3718 + e27465c commit 2a93838
Show file tree
Hide file tree
Showing 124 changed files with 6,973 additions and 1,895 deletions.
4 changes: 2 additions & 2 deletions docs/source/ar/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,8 @@
title: التصدير إلى TorchScript
- local: benchmarks
title: المعايير
# - local: notebooks
# title: دفاتر الملاحظات مع الأمثلة
- local: notebooks
title: دفاتر الملاحظات مع الأمثلة
# - local: community
# title: موارد المجتمع
- local: troubleshooting
Expand Down
141 changes: 141 additions & 0 deletions docs/source/ar/notebooks.md

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,8 @@
title: GLPN
- local: model_doc/hiera
title: Hiera
- local: model_doc/ijepa
title: I-JEPA
- local: model_doc/imagegpt
title: ImageGPT
- local: model_doc/levit
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ You have access to the following tools:
To solve the task, you must plan forward to proceed in a series of steps, in a cycle of 'Thought:', 'Code:', and 'Observation:' sequences.
At each step, in the 'Thought:' sequence, you should first explain your reasoning towards solving the task, then the tools that you want to use.
Then in the 'Code:' sequence, you shold write the code in simple Python. The code sequence must end with '/End code' sequence.
Then in the 'Code:' sequence, you should write the code in simple Python. The code sequence must end with '/End code' sequence.
During each intermediate step, you can use 'print()' to save whatever important information you will then need.
These print outputs will then be available in the 'Observation:' field, for using this information as input for the next step.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/agents_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ agent.run("How many more blocks (also denoted as layers) are in BERT base encode

## Display your agent run in a cool Gradio interface

You can leverage `gradio.Chatbot`to display your agent's thoughts using `stream_to_gradio`, here is an example:
You can leverage `gradio.Chatbot` to display your agent's thoughts using `stream_to_gradio`, here is an example:

```py
import gradio as gr
Expand Down
9 changes: 6 additions & 3 deletions docs/source/en/autoclass_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,20 +138,23 @@ Load a processor with [`AutoProcessor.from_pretrained`]:

<frameworkcontent>
<pt>
The `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`]:
The `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`].

> [!WARNING]
> By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
```py
>>> from transformers import AutoModelForSequenceClassification

>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype="auto")
```

Easily reuse the same checkpoint to load an architecture for a different task:

```py
>>> from transformers import AutoModelForTokenClassification

>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")
>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype="auto")
```

<Tip warning={true}>
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/generation_strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,8 @@ just like in multinomial sampling. However, in assisted decoding, reducing the t
['Alice and Bob, a couple of friends of mine, who are both in the same office as']
```

We recommend to install `scikit-learn` library to enhance the candidate generation strategy and achieve additional speedup.

#### Universal Assisted Decoding

Universal Assisted Decoding (UAD) adds support for main and assistant models with different tokenizers.
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ Flax), PyTorch, and/or TensorFlow.
| [Hiera](model_doc/hiera) ||||
| [Hubert](model_doc/hubert) ||||
| [I-BERT](model_doc/ibert) ||||
| [I-JEPA](model_doc/ijepa) ||||
| [IDEFICS](model_doc/idefics) ||||
| [Idefics2](model_doc/idefics2) ||||
| [Idefics3](model_doc/idefics3) ||||
Expand Down
6 changes: 6 additions & 0 deletions docs/source/en/internal/generation_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,3 +436,9 @@ A [`Constraint`] can be used to force the generation to include specific tokens

[[autodoc]] SynthIDTextWatermarkDetector
- __call__

## Compile Utils

[[autodoc]] CompileConfig
- __call__

3 changes: 2 additions & 1 deletion docs/source/en/kv_cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ Fun fact: The shortest war in history was between Britain and Zanzibar on August

<Tip warning={true}>

Cache offloading requires a GPU and can be slower than dynamic KV cache. Use it if you are getting CUDA out of memory errors.
Cache offloading requires a CUDA GPU and can be slower than dynamic KV cache. Use it if you are getting CUDA out of memory errors.

</Tip>

Expand Down Expand Up @@ -261,6 +261,7 @@ This will use the [`~OffloadedStaticCache`] implementation instead.
>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0]
"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
```
Cache offloading requires a CUDA GPU.


### Sliding Window Cache
Expand Down
35 changes: 20 additions & 15 deletions docs/source/en/llm_optims.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,13 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")

model.generation_config.cache_implementation = "static"

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
input_text = "The theory of special relativity states "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)

outputs = model.generate(**input_ids)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Expand All @@ -89,11 +89,11 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
input_text = "The theory of special relativity states "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)
prompt_length = input_ids.input_ids.shape[1]
model.generation_config.max_new_tokens = 16

Expand Down Expand Up @@ -126,14 +126,15 @@ If you want to go further down a level, the [`StaticCache`] object can also be p
from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
from transformers.testing_utils import CaptureLogger
import torch
from accelerate.test_utils.testing import get_backend

prompts = [
"Simply put, the theory of relativity states that ",
"My favorite all time favorite condiment is ketchup.",
]

NUM_TOKENS_TO_GENERATE = 40
torch_device = "cuda"
torch_device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="</s>", padding_side="right")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
Expand Down Expand Up @@ -201,11 +202,11 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")

model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True)
input_text = "The theory of special relativity states "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device.type)

outputs = model.generate(**input_ids)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Expand Down Expand Up @@ -241,13 +242,14 @@ Enable speculative decoding by loading an assistant model and passing it to the
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate.test_utils.testing import get_backend

device = "cuda" if torch.cuda.is_available() else "cpu"
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
Expand All @@ -262,13 +264,14 @@ For speculative sampling decoding, add the `do_sample` and `temperature` paramet
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate.test_utils.testing import get_backend

device = "cuda" if torch.cuda.is_available() else "cpu"
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.7)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Expand All @@ -290,13 +293,14 @@ To enable prompt lookup decoding, specify the number of tokens that should be ov
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate.test_utils.testing import get_backend

device = "cuda" if torch.cuda.is_available() else "cpu"
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Expand All @@ -311,13 +315,14 @@ For prompt lookup decoding with sampling, add the `do_sample` and `temperature`
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate.test_utils.testing import get_backend

device = "cuda" if torch.cuda.is_available() else "cpu"
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3, do_sample=True, temperature=0.7)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["The second law of thermodynamics states that energy cannot be created nor destroyed. It's not a"]
Expand Down
78 changes: 78 additions & 0 deletions docs/source/en/model_doc/ijepa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# I-JEPA

## Overview

The I-JEPA model was proposed in [Image-based Joint-Embedding Predictive Architecture](https://arxiv.org/pdf/2301.08243.pdf) by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas.
I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

The abstract from the paper is the following:

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image- based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample tar- get blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transform- ers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

This model was contributed by [jmtzt](https://huggingface.co/jmtzt).
The original code can be found [here](https://github.com/facebookresearch/ijepa).

## How to use

Here is how to use this model for image feature extraction:

```python
import requests
import torch
from PIL import Image
from torch.nn.functional import cosine_similarity

from transformers import AutoModel, AutoProcessor

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)

model_id = "jmtzt/ijepa_vith14_1k"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

@torch.no_grad()
def infer(image):
inputs = processor(image, return_tensors="pt")
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)


embed_1 = infer(image_1)
embed_2 = infer(image_2)

similarity = cosine_similarity(embed_1, embed_2)
print(similarity)
```

## IJepaConfig

[[autodoc]] IJepaConfig

## IJepaModel

[[autodoc]] IJepaModel
- forward

## IJepaForImageClassification

[[autodoc]] IJepaForImageClassification
- forward
2 changes: 1 addition & 1 deletion docs/source/en/perf_infer_cpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Enable BetterTransformer with the [`PreTrainedModel.to_bettertransformer`] metho
```py
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder")
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", torch_dtype="auto")
```

## TorchScript
Expand Down
Loading

0 comments on commit 2a93838

Please sign in to comment.