Generate: end-to-end compilation #30788

gante · 2024-05-13T17:02:59Z

What does this PR do?

This PR introduces an MVP for end-to-end generate compilation -- the whole generate function can be compiled into a single graph. From this state, we can start testing and improving many aspects of our generate code wrt compilation, as well as unlock features that rely on generate being a single graph.

⚠️ generate compilation in this PR has MANY restrictions, including:

Compilation is veeery slow 🐌 ;
torch.multinomial(probs, num_samples=1) can't be compiled into a cuda graph, so do_sample=True is not compatible without a cuda graph break;
stopping when the model throws an EOS token is not compatible;

Tests

The following slow tests were passing (as of 4241ab3):

RUN_SLOW=1 py.test -vv tests/models/ -k test_generate_compile_fullgraph
RUN_SLOW=1 py.test -vv tests/utils/test_cache_utils.py
RUN_SLOW=1 py.test -vv tests/models/llama/test_modeling_llama.py

Performance

TL;DR:

no big speed wins vs compiling forward, ~5% faster. The main benefit is the ability to export a single graph
compilation takes a LOT of time. On both test devices, the process got killed when max_new_tokens=256

Setup:

A100 80GB / RTX 3090 24GB
compiling forward but calling generate (i.e. there is some overhead from calling the uncompiled generate)
model: google/gemma-2b

code

import copy
import os
import torch
from torch.utils import benchmark

from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Benchmarking settings
BSZ = [1, 4]
NEW_TOK = [16, 256]
# BSZ = [1]
# NEW_TOK = [16]
N_ITER = 20
MODEL_ID = "google/gemma-2b"
ATTN_IMPL = "sdpa"

# Other constants
FRANCE_ARTICLE = (  # @noqa
    """<s>Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." Robin\'s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites. The publications said that they watched the video, which was found by a source close to the investigation. \"One can hear cries of 'My God' in several languages,\" Paris Match reported. "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the cockpit door with a heavy object.  Towards the end, after a heavy shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt, editor-in-chief of Bild online. An official with France's accident investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said, but that they "hadn\'t been exploited yet." Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working hand-in-hand with investigators. But none of the cell phones found so far have been sent to the institute, Menichini said. Asked whether staff involved in the search could have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are "very confident" that the clip is real. He noted that investigators only revealed they\'d recovered cell phones from the crash site after Bild and Paris Match published their reports. "That is something we did not know before. ... Overall we can say many things of the investigation weren't revealed by the investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the controls of Germanwings Flight 9525, which he's accused of deliberately crashing last week in the French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school discovered in an internal investigation, Lufthansa said, included medical documents he submitted in connection with resuming his flight training. The announcement indicates that Lufthansa, the parent company of Germanwings, knew of Lubitz's battle with depression, allowed him to continue training and ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100% fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was sharing the information and documents -- including training and medical records -- with public prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the past week to recover human remains and plane debris scattered across a steep mountainside. He saw the crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late Tuesday that no visible human remains were left at the site but recovery teams would keep searching. French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested. In the meantime, the recovery of the victims' personal belongings will start Wednesday, Menichini said. Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board. Check out the latest from our correspondents . The details about Lubitz's correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and Lubitz's possible motive for downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor\'s office in Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot's license. Kumpa emphasized there's no evidence suggesting Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to lose his pilot's license, a European government official briefed on the investigation told CNN on Tuesday. While flying was "a big part of his life," the source said, it\'s only one theory being considered. Another source, a law enforcement official briefed on the investigation, also told CNN that authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems. Lubitz's girlfriend told investigators he had seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had psychological issues, the European government official said. But no matter what details emerge about his previous mental health struggles, there's more to the story, said Brian Russell, a forensic psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they weren't going to keep doing their job and they're upset about that and so they're suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person's problems." Germanwings crash compensation: What we know . Who was the captain of Germanwings Flight 9525? CNN's Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura Smith-Spark wrote from London. CNN's Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report."""
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float16, attn_implementation=ATTN_IMPL
).to("cuda")
prompt_length = tokenizer([FRANCE_ARTICLE], return_tensors="pt").input_ids.shape[1]
label_ms_per_token = f"Throughput (time/foward pass, prompt = {prompt_length} tokens)"
label_first_step = f"First call (time, prompt = {prompt_length} tokens)"


def print_results(all_results):
    print("\n")
    compare = benchmark.Compare(all_results)
    compare.trim_significant_figures()
    compare.colorize(rowwise = True)
    compare.print()


def time_generate_call(model, task, ms_per_token, first_step, static_cache=False):
    for bsz in BSZ:
        for max_new_tokens in NEW_TOK:
            input_ids = tokenizer([FRANCE_ARTICLE] * bsz, return_tensors="pt").to("cuda")
            description = f"batch size, max_new_tokens: {bsz, max_new_tokens}"
            task_spec_ms_per_token = benchmark.TaskSpec(
                stmt="", setup="", description=task, label=label_ms_per_token, sub_label=description
            )
            task_spec_ms_first_step = benchmark.TaskSpec(
                stmt="", setup="", description=task, label=label_first_step, sub_label=description
            )

            # generate EXACTLY `max_new_tokens` tokens (no early termination due to `eos_token_id`)
            generation_kwargs = {
                "max_new_tokens": max_new_tokens,
                "min_new_tokens": max_new_tokens,
                "eos_token_id": None,
                "do_sample": False,
            }
            if static_cache:
                generation_kwargs["cache_implementation"] = "static"
            generation_config = copy.deepcopy(model.generation_config)
            generation_config.update(**generation_kwargs)

            past_key_values = None

            torch.compiler.reset()
            results = []
            for _ in range(N_ITER):
                start = torch.cuda.Event(enable_timing=True)
                end = torch.cuda.Event(enable_timing=True)
                start.record()
                gen_out = model.generate(
                    **input_ids, generation_config=generation_config, past_key_values=past_key_values
                )
                end.record()
                torch.cuda.synchronize()
                total_time = start.elapsed_time(end) / 1000  # time in seconds
                time_per_forward = total_time / max_new_tokens
                assert gen_out.shape[1] == max_new_tokens + prompt_length
                results.append(time_per_forward)

            if static_cache:
                del model._cache

            ms_per_token.append(benchmark.Measurement(1, results[3:], task_spec_ms_per_token, metadata=None))
            first_step.append(benchmark.Measurement(
                1, [results[0] * max_new_tokens], task_spec_ms_first_step, metadata=None)
            )
            print_results(ms_per_token)
            print_results(first_step)
            print("*" * 80)


ms_per_token = []
first_step = []

# dynamic
time_generate_call(model, "dynamic", ms_per_token, first_step)

# static
time_generate_call(model, "static", ms_per_token, first_step, static_cache=True)

# static + forward compiled
torch.compiler.reset()
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
time_generate_call(model, "static + fwd compiled", ms_per_token, first_step, static_cache=True)

# generate compiled
torch.compiler.reset()
model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True)
time_generate_call(model, "generate compiled", ms_per_token, first_step)

A100

RTX3090

HuggingFaceDocBuilderDev · 2024-05-13T17:23:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2024-05-14T10:12:07Z

This is great 🔥 🚀 . Please ping me once ready - I would like to run some benchmarks!

Also, it would be best to have an integration test for at least 2 models, probably llama and gemma.

ydshieh · 2024-05-14T10:13:18Z

I added [WIP] to the title but didn't change it to draft mode. Hope I am doing it right.

ydshieh · 2024-05-14T10:17:36Z

@gante Do you have the generation running time numbers of forward compile v.s. end-to-end compile, say for llama?

gante · 2024-05-14T12:57:35Z

@ydshieh I'm not sure it will be faster, the biggest win is in terms of compatibility (compiled graphs are more portable and easily consumed by specialized hardware). I will run and share a few benchmarks in any case 🤗

ydshieh · 2024-05-14T13:26:41Z

Nice! If you are OK with that, maybe adopt the following small & simple script for a (even just a first) benchmark.

import os
import torch
import datetime

from transformers import AutoTokenizer, AutoModelForCausalLM

token = "ADD_YOUR_OWN_TOKEN"

os.environ["TOKENIZERS_PARALLELISM"] = "false"

batch_size = 1
n_iter = 3

ckpt = "google/gemma-2b"

tokenizer = AutoTokenizer.from_pretrained(ckpt, token=token)
model = AutoModelForCausalLM.from_pretrained(ckpt, token=token, torch_dtype=torch.float16).to("cuda")

model.generation_config.max_new_tokens = 16
model.generation_config.max_new_tokens = 16

model.generation_config.cache_implementation = "static"
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

input_text = "Why dogs are cute."
input_ids = tokenizer([input_text] * batch_size, return_tensors="pt").to("cuda")

for i in range(n_iter):
    s = datetime.datetime.now()
    outputs = model.generate(**input_ids, do_sample=False)
    t = datetime.datetime.now()
    e = (t-s).total_seconds()
    print(e)

src/transformers/models/cohere/modeling_cohere.py

gante · 2024-05-25T15:52:38Z

@ydshieh TL;DR not much faster, at least for now :) (benchmarks in the PR header)

zucchini-nlp

Very cool 🔥

src/transformers/generation/utils.py

src/transformers/models/cohere/modeling_cohere.py

src/transformers/models/gemma/modeling_gemma.py

gante · 2024-05-29T13:42:40Z

(seemingly unrelated CI failures)

ArthurZucker · 2024-06-05T12:13:39Z

I'll have a look!

ArthurZucker

🔥 will this be breaking compatibility for people who use torch.compile(model.forward)

ArthurZucker · 2024-06-12T13:57:05Z

src/transformers/generation/utils.py

@@ -2520,6 +2541,7 @@ def _sample(

            unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores)
            this_peer_finished = unfinished_sequences.max() == 0
+            cur_len += 1


kinda wondering if torch compile fairs well with ints like this, but I gues yeah

ArthurZucker · 2024-06-12T13:57:21Z

src/transformers/generation/utils.py

@@ -2500,6 +2520,7 @@ def _sample(
            # token selection
            if do_sample:
                probs = nn.functional.softmax(next_token_scores, dim=-1)
+                # TODO (joao): this OP throws "skipping cudagraphs due to ['incompatible ops']", find solution


maybe we can open an issue on torch directly!

src/transformers/models/cohere/modeling_cohere.py

ydshieh · 2024-06-12T14:51:41Z

compilation takes a LOT of time. On both test devices, the process got killed when max_new_tokens=256

It might relate to pytorch/pytorch#128424. But max_new_tokens=256 is quite short, however in this PR, there are much more involved (the code from generate but outside forward).

The process got killed: this is likely due to memory? I also observed this and reported in the above pytorch issue page.

Do you try compile without reduce-overhead? I know it's not ideal, but just to see if it works much better (except the ultra performance)

gante · 2024-07-09T12:20:38Z

(ready to be merged, needs CI fixing from external issues -- on it)

ArthurZucker

Very good, just don't like the control flow on the modeling size as we should update the modeling code IMO and the prepare_inputs_for generation should forget about get seq len and get max lenght if possible (infered from generation config? )

src/transformers/models/cohere/modeling_cohere.py

src/transformers/models/llama/modeling_llama.py

gante · 2024-07-27T16:23:39Z

Ran benchmarks again to double-check, the conclusions at the top persist:

(gemma 2b, measured on RTX4090)

commit 37c5ca5eb9012a1009cf23b892828902f6a8799a Author: Raushan Turganbay <[email protected]> Date: Tue Aug 6 10:24:19 2024 +0500 Cache: create docs (#32150) * draft * updates * works? * try adding python example in hidden section * another try * hwo do i render python * format as html code? * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * one more small update * should render hidden secrtion now * add outputs * fix links * check links * update all links * update with offloaded cache * all cache is importable, so they appear in docs * fix copies * docstring... --------- Co-authored-by: Joao Gante <[email protected]> commit 13dc6b0853c3cb54e79b18105c0528bc9e84881c Author: Francisco Kurucz <[email protected]> Date: Mon Aug 5 19:14:50 2024 -0300 Fix documentation links and code reference to model llava-next (#32434) commit 7e5d46ded433605a906fcab6be43ac85307cca9b Author: amyeroberts <[email protected]> Date: Mon Aug 5 16:33:19 2024 +0100 Respect the config's attn_implementation if set (#32383) * Respect the config's attn if set * Update test - can override in from_config * Fix commit 458b0cd2c544cdd6c700f9b0c21077c889bcee6c Author: Sai-Suraj-27 <[email protected]> Date: Mon Aug 5 19:49:42 2024 +0530 fix: Updated `test_embeded_special_tokens` for luke and mluke models (#32413) Fixed tokenizertests for luke, mluke models. commit baf7e5c927744122c89ab1270c6c312541c7eb41 Author: Abdi <[email protected]> Date: Mon Aug 5 21:15:36 2024 +0800 Persist embedding type of BART and mBART models after resize (#32242) * fix: persist embedding type of MBartConditonalGeneration after resize * fix: persist embedding type of BartConditonalGeneration after resize commit f5f1e52f6cf13cdf63ff25c311d33e2f2a842911 Author: Francisco Kurucz <[email protected]> Date: Mon Aug 5 05:18:28 2024 -0300 Fix documentation references to google/bit-50 model (#32407) commit ea5da52ebc062ff56f0e3aa05b0e3cc981731e14 Author: Nicholas Broad <[email protected]> Date: Mon Aug 5 00:51:58 2024 -0700 add values for neftune (#32399) I always forget what typical values are, and I have to look at the paper everytime. This will be a helpful reminder. commit 3d7c2f9dea45338b7ebcd459b452e2fad7abfa1f Author: Ita Zaporozhets <[email protected]> Date: Mon Aug 5 09:22:48 2024 +0200 * save total_vocab_size = vocab_size + user added tokens to speed up operation * updating length when added_tokens_decoder is set * add test len(tokenizer) commit 3bb646a54f42030e9bafa47cd3f64367691a3bc5 Author: Raushan Turganbay <[email protected]> Date: Mon Aug 5 11:58:42 2024 +0500 Phi3 tests: fix typing for Python 3.8 (#32388) fix phi commit 05ae3a300d6f3534eeb99a08828a5bae6dd973db Author: TechInterMezzo <[email protected]> Date: Mon Aug 5 08:40:58 2024 +0200 fix: SeamlessM4TFeatureExtractor stride remainder (#32088) * fix: SeamlessM4TFeatureExtractor stride remainder * Added attention mask size test * Reran ruff for style correction commit 847bb856d55e3664150e408448fa59d0705b4d60 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 5 08:38:34 2024 +0200 Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer (#32393) Bump keras in /examples/research_projects/decision_transformer Bumps [keras](https://github.com/keras-team/keras) from 2.8.0 to 2.13.1. - [Release notes](https://github.com/keras-team/keras/releases) - [Commits](https://github.com/keras-team/keras/compare/v2.8.0...v2.13.1) --- updated-dependencies: - dependency-name: keras dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 621fb3c0edddf98f3272f3b197e772af4fa30b6c Author: Xueshen Liu <[email protected]> Date: Sat Aug 3 14:07:55 2024 -0400 MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. (#31500) * Mixtral: remove unnecessary plus 1 when calculating rotary_seq_len, allowing position_ids=None (no auto position_ids generation could be unsafe) * fix typo [:-1] to [:, -1] * to meet formatting requirement * to meet formatting requirement * remove white space * MixtralFlashAttention2: put "+ 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. Fix format/style issue. * propagate to startcoder2, phi3, mixtral and qwen2 * update qwen2_moe commit 7c31d05b59a9dce24b8ddc4b2bb8c8cf6bb5fd77 Author: Shaopeng Fu <[email protected]> Date: Sat Aug 3 19:24:11 2024 +0300 fix: (issue #32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. (#32157) fix: Exception raised when running . commit c1aa0edb48217f416f4bbe6e3a9db1500284513b Author: Sanchit Gandhi <[email protected]> Date: Fri Aug 2 17:32:50 2024 +0800 [generate] only require an attention mask for mps with torch<2.4 (#32367) * up * style * stopping commit 083e13b7c47f674b11c74d1b7c7ee7cd1241b406 Author: Joao Gante <[email protected]> Date: Fri Aug 2 09:39:45 2024 +0100 RoPE: Add numerical tests ✨ (#32380) tests! :D commit 2af199c42b545f6248475ce456dd6c2a351b8522 Author: Raushan Turganbay <[email protected]> Date: Fri Aug 2 09:54:16 2024 +0500 Update docs (#32368) nits commit 82efc53513a51660e629c7eca8210af1d67df00b Author: Zach Mueller <[email protected]> Date: Thu Aug 1 15:18:43 2024 -0400 Yell at the user if zero-3 init wasn't performed, but expected to have been done (#32299) * Test this zach * Test for improper init w/o zero3 * Move back * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Get rid of stars in warning * Make private * Make clear --------- Co-authored-by: amyeroberts <[email protected]> commit 51ab25e2932da15511ced35bcbdfa92d25c4794c Author: OsamaS99 <[email protected]> Date: Thu Aug 1 14:57:42 2024 +0200 Fixed Hybrid Cache Shape Initialization. (#32163) * fixed hybrid cache init, added test * Fix Test Typo --------- Co-authored-by: Aaron Haag <[email protected]> commit e3d8285a84f803e962050e2c2283f3362e36bfbc Author: Joao Gante <[email protected]> Date: Thu Aug 1 13:46:11 2024 +0100 Docker: add `speech` dep to the consistency docker image (#32374) commit ca59d6f77c9fda197222f9aa9205d8c7b5dff34e Author: Nikos Karampatziakis <[email protected]> Date: Thu Aug 1 05:42:07 2024 -0700 Offloaded KV Cache (#31325) * Initial implementation of OffloadedCache * enable usage via cache_implementation * Address feedback, add tests, remove legacy methods. * Remove flash-attn, discover synchronization bugs, fix bugs * Prevent usage in CPU only mode * Add a section about offloaded KV cache to the docs * Fix typos in docs * Clarifications and better explanation of streams commit b4727a1216bb21df2795e973063ed07202235d7e Author: Omar Salman <[email protected]> Date: Thu Aug 1 17:32:13 2024 +0500 Fix conflicting key in init kwargs in PreTrainedTokenizerBase (#31233) * Fix conflicting key in init kwargs in PreTrainedTokenizerBase * Update code to check for callable key in save_pretrained * Apply PR suggestions * Invoke CI * Updates based on PR suggestion commit db8c7caeb6b3969a2153b36ba3e5fdef6534c1d6 Author: Viktor Scherbakov <[email protected]> Date: Thu Aug 1 14:30:10 2024 +0200 Empty list in defaults for LLaMA special tokens during weights conversion (#32342) empty list in defaults commit 2229ebe7220fb54bc5f91f575c2d7a988e7122cb Author: Ita Zaporozhets <[email protected]> Date: Thu Aug 1 13:57:41 2024 +0200 update clean_up_tokenization_spaces warning (#32371) commit 05c1f9af9a5ebd213dd923e97f6fbed4c115f3c6 Author: Hanna Yukhymenko <[email protected]> Date: Thu Aug 1 13:52:05 2024 +0200 Check device map for saving tokenizer config on TPU (fix for issue #31971) (#32043) * Remove TPU device map for saving tokenizer config * Update tokenization_utils_base.py * Fix error msg when passing non-string device into tokenizer * Fix error message for non-string tokenizer device * Print out tokenizer device type in error msg * Update tokenization_utils_base.py commit 9e2828403218da16d9759c9be020b70f51df373d Author: nv-guomingz <[email protected]> Date: Thu Aug 1 19:51:20 2024 +0800 add missing attribute _supports_param_buffer_assignment for gpt-j. (#32359) Co-authored-by: Guoming Zhang <[email protected]> commit 48ed24c50ab29bf690f2ab030721e6a8b0aa5205 Author: Lunwen He <[email protected]> Date: Thu Aug 1 04:49:00 2024 -0700 Remove size check between attn_weights and kv_seq_len for phi3 (#32339) * Remove size check between attn_weights and kv_seq_len * add unit tests commit e234061cddd28bb8b82144833241883816289e40 Author: Sanchit Gandhi <[email protected]> Date: Thu Aug 1 18:10:56 2024 +0800 [whisper] compile compatibility with long-form decoding (#31772) * [whisper] compile compatibility with long-form decoding * clarify comment * fix after rebase * finalise * fix bsz * fix cache split * remove contiguous * style * finish * update doc * prevent cuda graph trace commit 9451a385261b30e7319a2c93285ab76161e8c003 Author: Sanchit Gandhi <[email protected]> Date: Thu Aug 1 16:05:27 2024 +0800 [enc-dec cache] fix bug in indexing (#32370) commit 453e74884fb7e2613e7b45033fbb3c1cadb638b4 Author: Raushan Turganbay <[email protected]> Date: Thu Aug 1 09:48:03 2024 +0500 LLaVa: add cache class attribute (#32278) cache class flag commit 14ee2326e51cb210cec72f31b248cb722e9d5d1f Author: Ricardo <[email protected]> Date: Thu Aug 1 06:34:22 2024 +0800 fix: warmup_steps check for training_args (#32236) commit 53f0c9c2906e0b0f1623bfdfb420fca1e655098d Author: Sai-Suraj-27 <[email protected]> Date: Thu Aug 1 01:26:50 2024 +0530 fix: Removed unnecessary `@staticmethod` decorator (#32361) * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. commit 92abe6033491dcaa958235e551f40f6b417d3771 Author: fxmarty <[email protected]> Date: Wed Jul 31 20:03:07 2024 +0200 >3-5x faster torch.compile forward compilation for autoregressive decoder models (#32227) * draft * apply changes to all relevant archs * rerun ci - check_docstrings.py failing? * fix docstring * move 2D->4D mask creation to modeling file * repo consistency * fix the batch size = 1 case - calling contiguous is not enough * nit * style * propagate to gemma/gemma-2 * prepare inputs for gemma generation * implement test and tiny fix in gemma2 * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: Arthur <[email protected]> * fix copies * ci pass * fix gemma's test_compile_static_cache tests * flacky * retrigger ci --------- Co-authored-by: sanchit-gandhi <[email protected]> Co-authored-by: Arthur <[email protected]> commit b46bd8b9d2ac991c0c04674957ebc0a65fb3f42b Author: Aymeric Roucher <[email protected]> Date: Wed Jul 31 18:44:53 2024 +0200 Fix error when streaming to gradio with non-string tool arguments (#32360) Fix error when streaming agent run to gradio with non-string tool arguments commit ef177a5e1cdf0ca53e24e6d76e813198f7300dc4 Author: Joao Gante <[email protected]> Date: Wed Jul 31 16:04:48 2024 +0100 Gemma 2: support assisted generation (#32357) commit 5f1fcc299cb00c1edce5eb1efb8bacdde2365690 Author: amyeroberts <[email protected]> Date: Wed Jul 31 14:51:04 2024 +0100 [Idefics2] - Fix FA2 call for Perceiver layer (#32275) * Fix FA2 call for Perciever layer * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 * Fix up * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 commit b75ad56620431984a44a962c98136c8571b4fca9 Author: Joao Gante <[email protected]> Date: Wed Jul 31 11:12:46 2024 +0100 Llama 3.1: Fix incorrect `inv_freq` assignment (#32330) fix 💩 commit 7f552e28e0aca00ce60868c7620f7463eab60e14 Author: Raushan Turganbay <[email protected]> Date: Wed Jul 31 10:33:38 2024 +0500 Gemma2 and flash-attention (#32188) * enable flash-attn & static cache * this works, not the prev * fix for sliding window layers * not needed anymore commit a3264332cfb5ab8675ddb42740a75aeee1782a74 Author: Raushan Turganbay <[email protected]> Date: Wed Jul 31 10:01:12 2024 +0500 LLaVA-NeXT: fix anyres shapes (#32314) fix commit 6e2d04e429dc4ce240c99bd14b7b84550b79fd73 Author: Joshua Lochner <[email protected]> Date: Tue Jul 30 23:36:38 2024 +0200 Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191) * Remove user-defined tokens which can be obtained through merges * Remove debug line * formatting * Refactor spm slow -> fast converter * revert unnecessary refactor * set comprehension * remove test files * Use `vocab_scores` * Always replace spiece underline with space in decode * we no longer need token filtering * Add save fast load slow unit test * Remove tokenizers version check * Remove duplicate code * Make `<start_of_turn>` and `<end_of_turn>` special tokens * Bias merge priority with length if score is the same * Add unit test for merge priority * CI commit 026a173a64372e9602a16523b8fae9de4b0ff428 Author: Joao Gante <[email protected]> Date: Tue Jul 30 18:56:10 2024 +0100 Repo checks: skip docstring checks if not in the diff (#32328) * tmp * skip files not in the diff * use git.Repo instead of an external subprocess * add tiny change to confirm that the diff is working on pushed changes * add make quality task * more profesh main commit reference commit 516af4bb63538edc448f814e3690dd5171c4f311 Author: fkrasnov2 <[email protected]> Date: Tue Jul 30 20:21:45 2024 +0300 fixes #32329 : The Torch code is correct - to get an average of 10% o… (#32335) fixes #32329 : The Torch code is correct - to get an average of 10% of the total, we want to take 50% of the remainder after we've already masked 80% with [MASK] in the previous step. commit 62c60a30181a65e1a3a7f19c3055a240a6a21335 Author: Wing Lian <[email protected]> Date: Tue Jul 30 12:55:59 2024 -0400 fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276) commit 16271080333ad52be5349fb31d789fb232b68760 Author: Sai-Suraj-27 <[email protected]> Date: Tue Jul 30 22:23:03 2024 +0530 fix: Added missing raise keyword for few exceptions (#32333) Fixed raising of few exceptions. commit bd54ed2ed7f578e4122f3e6d536fbe3c9bc76de1 Author: plaggy <[email protected]> Date: Tue Jul 30 18:48:18 2024 +0200 Alternative agent plan (#32295) * new agent plan * plan type assertion * style corrections * better prompt naming * make fixup commit e68ec18ce224af879f22d904c7505a765fb77de3 Author: Joao Gante <[email protected]> Date: Tue Jul 30 15:49:14 2024 +0100 Docs: formatting nits (#32247) * doc formatting nits * ignore non-autodocs * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <[email protected]> * make fixup --------- Co-authored-by: amyeroberts <[email protected]> commit 2fbbcf5007509c66b02924ce6dcff66f58e7f58c Author: Yoach Lacombe <[email protected]> Date: Tue Jul 30 16:00:13 2024 +0200 Fix M4T for ASR pipeline (#32296) * tentative fix * do the same for M4T commit 084b5094eb490319719cc11cb05b751e0b419d49 Author: Luc Georges <[email protected]> Date: Tue Jul 30 14:49:26 2024 +0200 feat(ci): set `fetch-depth: 0` in trufflehog checkout step (#31663) commit 20528f067cf9204cea5178ce0f837245e146e159 Author: Teddy Ferdinan <[email protected]> Date: Tue Jul 30 11:25:54 2024 +0200 Cast epochs_trained to int when resuming training (#32286) * fix epochs_trained as int when resuming training * refactor --------- Co-authored-by: teddyferdinan <[email protected]> commit 934fe1504e6d5e87e01d96305f4d97faa63cf4c1 Author: Isotr0py <[email protected]> Date: Tue Jul 30 17:01:00 2024 +0800 Fix GGUF dequantize for `gguf==0.9.1` (#32298) * fix gguf dequantize for gguf==0.9.1 * fix old version * make style commit 3e8106d2533cbd890ddd1e919bd62132cd4718c3 Author: Gilad Turok <[email protected]> Date: Tue Jul 30 03:19:24 2024 -0400 Docs: fix GaLore optimizer code example (#32249) Docs: fix GaLore optimizer example Fix incorrect usage of GaLore optimizer in Transformers trainer code example. The GaLore optimizer uses low-rank gradient updates to reduce memory usage. GaLore is quite popular and is implemented by the authors in [https://github.com/jiaweizzhao/GaLore](https://github.com/jiaweizzhao/GaLore). A few months ago GaLore was added to the HuggingFace Transformers library in https://github.com/huggingface/transformers/pull/29588. Documentation of the Trainer module includes a few code examples of how to use GaLore. However, the `optim_targe_modules` argument to the `TrainingArguments` function is incorrect, as discussed in https://github.com/huggingface/transformers/pull/29588#issuecomment-2006289512. This pull request fixes this issue. commit f0bc49e7f61f74f055c47ad40e6010f57eed0b0b Author: Yih-Dar <[email protected]> Date: Mon Jul 29 22:12:21 2024 +0200 use torch 2.4 in 2 CI jobs (#32302) Co-authored-by: ydshieh <[email protected]> commit a24a9a66f446dcb9277e31d16255536c5ce27aa6 Author: Aymeric Roucher <[email protected]> Date: Mon Jul 29 20:12:44 2024 +0200 Add stream messages from agent run for gradio chatbot (#32142) * Add stream_to_gradio method for running agent in gradio demo commit 811a9caa2141bc98f96b36c69abcf1f934bd1fd2 Author: Guang Yang <[email protected]> Date: Mon Jul 29 10:19:15 2024 -0700 Make static cache compatible with torch.export (#32168) commit 7f5d644e69068825bb5b6e84cdc56b3d3a9bd04f Author: Sanchit Gandhi <[email protected]> Date: Mon Jul 29 21:24:42 2024 +0800 [pipeline] fix padding for 1-d tensors (#31776) * [pipeline] fix padding for 1-d tensors * add test * make style * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py Co-authored-by: Kamil Akesbi <[email protected]> * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py --------- Co-authored-by: Kamil Akesbi <[email protected]> commit 3fbaaaa64d1ef3d8327adb577994d3d11277c77a Author: Kamil Akesbi <[email protected]> Date: Mon Jul 29 11:19:52 2024 +0100 Whisper tokenizer word level timestamps (#32197) * fix _fix_key in PreTrainedModel * fix _find_longest_common_sequence * add test * remove result.json * nit * update test commit 7ffe25f2b935dcaf65079b04c5f91c8a42a99e28 Author: Joao Gante <[email protected]> Date: Mon Jul 29 10:52:13 2024 +0100 Generate: end-to-end compilation (#30788) * mvp * added test (a few models need fixes) * fix a few test cases * test nits * harder test 😈 * revert changes in stablelm * test with improved condition * add todo * tmp commit * merged with main * nits * add todo * final corrections * add docs for generation compilation * docs nits * add tip * PR suggestions * add more details to the compilation docs * fix cache positions * cache is now init in generate; update docs * tag test as flaky * docs * post rebase make fixup and other nits * remove unintended changes * whisper (encoder-decoder) not supported * move token default updates to ; add tests for token defaults * push changes * manual rebase * chameleon doesn't support this * fix test_static_cache_mha_mqa_gqa (broken in another PR) * docs: dynamic is better with end-to-end compilation commit 49928892d6491ff5a49c12cbc34695f6fa7ac0ed Author: Sai-Suraj-27 <[email protected]> Date: Mon Jul 29 15:20:43 2024 +0530 fix(docs): Fixed a link in docs (#32274) Fixed a link in docs. commit 6494479f1de9fe16e9c6f89e52eb0cf81f864a7c Author: Fanli Lin <[email protected]> Date: Mon Jul 29 17:29:11 2024 +0800 make `p_mask` a numpy array before passing to `select_starts_ends` (#32076) * fix * bug fix * refine * fix commit 535fe78b9f1d148684723e51f00645351880c47a Author: Joao Gante <[email protected]> Date: Mon Jul 29 10:06:05 2024 +0100 Repo: remove exceptions in `check_docstrings` (#32259) remove exceptions commit a2ad9d5ad53f68c1ad268f7f46538eac6f5b631b Author: Sai-Suraj-27 <[email protected]> Date: Mon Jul 29 14:13:09 2024 +0530 fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call (#32262) Removed one wrong argument passed to convert_blip_checkpoint function call. commit 5019aabfacf7599b9a6b4e7a1adc1fb5c9017727 Author: leejet <[email protected]> Date: Mon Jul 29 15:51:43 2024 +0800 Optimize t5 tokenize logic to avoid redundant calls (#32270) * Optimize t5 tokenize logic to avoid redundant calls * fix and overwrite copies commit f2122cc6eb8e50e4d1b45da54b43bba59a458b30 Author: Yih-Dar <[email protected]> Date: Mon Jul 29 09:42:54 2024 +0200 Upload new model failure report to Hub (#32264) upload Co-authored-by: ydshieh <[email protected]> commit f7396876849926afa87c9412d67c43618dad403d Author: Raushan Turganbay <[email protected]> Date: Mon Jul 29 10:58:59 2024 +0500 🚨 Bloom support for cache class (#31445) * bloom dynamic cache * bloom follows standard cache format * no skips for bloom anymore * use cache position when possible * clean up * codestyle * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * pr comments * isinstance fix * address comments * make musicgen test happy * [run-slow] bloom --------- Co-authored-by: amyeroberts <[email protected]> commit 44f6fdd74f84744b159fa919474fd3108311a906 Author: Joao Gante <[email protected]> Date: Sat Jul 27 10:19:46 2024 +0100 Llama 3.1: replace for loop by tensor ops at inv_freq initialization (#32244) * replace for loop by tensor ops * rm assert; readability commit 8da90687308a10b33c5553b8a506cc04aab31702 Author: Yih-Dar <[email protected]> Date: Fri Jul 26 20:52:45 2024 +0200 More flexible trigger condition (#32251) update Co-authored-by: ydshieh <[email protected]> commit 81233c069c166af033794134bd8888783ac49ebe Author: Raushan Turganbay <[email protected]> Date: Fri Jul 26 14:45:55 2024 +0500 Flash-Attn: fix generation when no attention mask or no pading (#32241) * fix * fix prev test (half of failures) * [run-slow] llama, gemma2 * [run-slow] llama, gemma2 commit 27c7f971c0dcd3bb423ea221fe2bce751d313119 Author: Fanli Lin <[email protected]> Date: Fri Jul 26 17:41:27 2024 +0800 [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` (#32039) * add flash attention check * fix * fix commit 5f841c74b62754f186a8c06a684d491524b7bc03 Author: Connor Anderson <[email protected]> Date: Fri Jul 26 05:05:46 2024 -0400 Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 (#31934) * Add check for target_sizes is None in post_process_image_guided_detection * Make sure Owlvit and Owlv2 in sync * Fix incorrect indentation; add check for correct size of target_sizes commit f9756d9edb23354e3df50f7eb3f6b3129a25e453 Author: Rohit Dwivedula <[email protected]> Date: Fri Jul 26 04:05:38 2024 -0500 Adds: extra_repr for RMSNorm layers in most models (#32204) * adds: extra_repr() to RMSNorm layers in multiple models * adds: extra_repr for deprecated models as well * formatting as per style guide commit b8e5cd5396f7c0cc2d5e10be6696ea38742abf51 Author: Sai-Suraj-27 <[email protected]> Date: Fri Jul 26 14:03:02 2024 +0530 Refactor: Removed un-necessary `object` base class (#32230) * Refactored to remove un-necessary object base class. * small fix. commit 1c7ebf1d6eaf0ed0fd4101fd6eb7e64601429cfe Author: João Nadkarni <[email protected]> Date: Fri Jul 26 09:38:59 2024 +0200 don't log base model architecture in wandb if log model is false (#32143) * don't log base model architecture in wandb is log model is false * Update src/transformers/integrations/integration_utils.py Co-authored-by: amyeroberts <[email protected]> * convert log model setting into an enum * fix formatting --------- Co-authored-by: amyeroberts <[email protected]> commit c46edfb8230bcc3152e8338742dc4822289acb3d Author: Raushan Turganbay <[email protected]> Date: Fri Jul 26 10:52:06 2024 +0500 Resize embeds with DeepSpeed (#32214) * fix resize when deepspeed * deepsped uses new embeds * we needed this commit fad15fba78e4603cd20695757ad899a6687485f9 Author: Raushan Turganbay <[email protected]> Date: Fri Jul 26 10:17:27 2024 +0500 Llava: generate without images (#32183) * llava w/o images * tests commit 4ab33c2d81866d4dd2f29df07f1a35491acbb39b Author: Raushan Turganbay <[email protected]> Date: Fri Jul 26 10:16:06 2024 +0500 Generation: stop at `eos` for assisted decoding (#31301) * fix * move changes to prompt lookup * add test * set eos in assistant model * style * fix flakiness * changes for new `main` * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <[email protected]> * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <[email protected]> * add comment to explain --------- Co-authored-by: amyeroberts <[email protected]> commit 9d6c0641c4a3c2c5ecf4d49d7609edd5b745d9bc Author: Pavel Iakubovskii <[email protected]> Date: Thu Jul 25 19:20:47 2024 +0100 Fix code snippet for Grounding DINO (#32229) Fix code snippet for grounding-dino commit 3a83ec48a63a8298c8193be48cf00785674bfb70 Author: jrhe <[email protected]> Date: Thu Jul 25 17:16:13 2024 +0100 Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac (#31846) * use currently active microphone on mac for ffmpeg_microphone * Allow ffmpeg_microphone device to be specified Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]> commit 6ed0bf1e8543a7d8e6640bbf9a655c5e1401f7de Author: Huazhong Ji <[email protected]> Date: Fri Jul 26 00:01:06 2024 +0800 translate philosophy.md to chinese (#32177) * translate philosophy.md to chinese * add the missing link commit df6eee9201e4ba2b80cea021a18e95ada26ca2cc Author: Yih-Dar <[email protected]> Date: Thu Jul 25 16:12:23 2024 +0200 Follow up for #31973 (#32025) * fix * [test_all] trigger full CI --------- Co-authored-by: ydshieh <[email protected]> commit de2318894e4f971ea2273c653a702dc93db2bd6a Author: Kashif Rasul <[email protected]> Date: Thu Jul 25 15:12:23 2024 +0200 [warnings] fix E721 warnings (#32223) fix E721 warnings commit 9b9a54e61bf8749588178b37c23d77b90679fd10 Author: Kashif Rasul <[email protected]> Date: Thu Jul 25 15:11:43 2024 +0200 [BigBird Pegasus] set _supports_param_buffer_assignment to False (#32222) set _supports_param_buffer_assignment to False commit 1ecedf1d9ee927bac5b5bae8cb1892d936a5b622 Author: Austin <[email protected]> Date: Thu Jul 25 07:20:27 2024 -0500 Update question_answering.py (#32208) commit f53a5dec7b03eb195dc89c82ae761b033db1ceb6 Author: Huazhong Ji <[email protected]> Date: Thu Jul 25 17:04:04 2024 +0800 remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 (#32210) remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 commit 5658e749adbaaf883caec003cecae8ce0a4261a6 Author: Sanchit Gandhi <[email protected]> Date: Thu Jul 25 16:58:02 2024 +0800 [whisper] fix short-form output type (#32178) * [whisper] fix short-form output type * add test * make style * update long-form tests * fixes * last fix * finalise test commit 85a1269e19af022e04bc2aad82572cd5a9e8cdd9 Author: Sai-Suraj-27 <[email protected]> Date: Wed Jul 24 22:30:21 2024 +0530 fix: Replaced deprecated `unittest method` with the correct one (#32198) Replaced deprecated unittest method with the correct one. commit edd68f4ed8db241bd3e9dc6c4ed96d471f243c9a Author: Matt <[email protected]> Date: Wed Jul 24 17:36:32 2024 +0100 :rotating_light: No more default chat templates (#31733) * No more default chat templates * Add the template to the GPT-SW3 tests since it's not available by default now * Fix GPT2 test * Fix Bloom test * Fix Bloom test * Remove default templates again commit 1c122a46dc3c4448901f8d2f3018d9d58b846ba5 Author: Penut Chen <[email protected]> Date: Wed Jul 24 23:59:59 2024 +0800 Support dequantizing GGUF FP16 format (#31783) * support gguf fp16 * support gguf bf16 with pytorch * add gguf f16 test * remove bf16 commit af0e4b7b37b2d7eefe7531cf5201a5d6bae85525 Author: Marc Sun <[email protected]> Date: Wed Jul 24 17:14:05 2024 +0200 Fix float8_e4m3fn in modeling_utils (#32193) * Fix float8_e4m3fn in modeling_utils * style * fix * comment commit 1392a6867f40a55dfabaf306745c67627598b1af Author: Raushan Turganbay <[email protected]> Date: Wed Jul 24 19:26:20 2024 +0500 Fix resize embedding with Deepspeed (#32192) fix resize when deepspeed commit 8d2534c4d0ab94a97a72d2ce6bb9ccd201abadb3 Author: Arthur <[email protected]> Date: Wed Jul 24 16:06:39 2024 +0200 let's not warn when someone is running a forward (#32176) * let's not warn when someone is running a foward without cache + self.training * more models * fixup commit e0182f3bd7f4753c1e378e052ceea67898d97359 Author: Joao Gante <[email protected]> Date: Wed Jul 24 15:00:48 2024 +0100 RoPE: relaxed rope validation (#32182) * relaxed rope check * lets also accept rope_type=None, defaulting to the original implementation * type and rope_type can coexist commit 165116bc145dcc186fa287e624b28a9ab3a79955 Author: amyeroberts <[email protected]> Date: Wed Jul 24 14:03:40 2024 +0100 Remove conversational pipeline tests (#32099) Remove conversation pipeline tests commit 5f4ee98a7ade33e1c54fdd6181d04ee7b426b392 Author: Dr. Artificial曾小健 <[email protected]> Date: Wed Jul 24 18:54:41 2024 +0800 Update qwen2.md (#32108) * Update qwen2.md outdated description * Update qwen2.md amended * Update qwen2.md Update * Update qwen2.md fix wrong version code, now good to go commit 8678879f1dc2578cec18232146bf19de97aecaa1 Author: 조준래 <[email protected]> Date: Wed Jul 24 19:38:49 2024 +0900 fix: default value reflects the runtime environment variables rather than the ones present at import time. (#32153) * fix: default value reflects the runtime environment variables rather than the ones present at import time. * Fix: Change `deterministic` to None by default; use env var if None commit 01be5b48790f113b7d71943b580c842e3e097988 Author: Rohit Dwivedula <[email protected]> Date: Wed Jul 24 02:09:59 2024 -0500 adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer (#32171) * adds: extra_repr() to MambaRMSNorm to include the hidden size of the layer * style fix with ruff: commit c85510f958e6955d88ea1bafb4f320074bfbd0c1 Author: Fanli Lin <[email protected]> Date: Wed Jul 24 00:47:51 2024 +0800 [docs] change temperature to a positive value (#32077) fix commit bc2adb0112b6677b0dfb4105c74570a0f92183eb Author: Sai-Suraj-27 <[email protected]> Date: Tue Jul 23 21:22:41 2024 +0530 fix: Fixed an if condition that is always evaluating to true (#32160) Fixed an if condition always evaluating to true. commit 23f6a43f82fb2980f4b30cf3f95eb3a940384895 Author: Joao Gante <[email protected]> Date: Tue Jul 23 16:48:16 2024 +0100 fix (#32162) commit d5a99dfcee6e94065cb7c83cc8ab6fc5daa0cc4e Author: Lysandre <[email protected]> Date: Tue Jul 23 16:58:17 2024 +0200 Llama 3.1 conversion Co-authored-by: Arthur Zucker <[email protected]> commit ff0d708fe627d6715f9a3e97d0a7947f70437447 Author: Lysandre <[email protected]> Date: Tue Jul 23 17:12:47 2024 +0200 Dev version: v4.44.0.dev0 commit d2c687b3f1859b5c61258af14abba5312c0e6201 Author: Sai-Suraj-27 <[email protected]> Date: Tue Jul 23 20:37:31 2024 +0530 Updated `ruff` to the latest version (#31926) * Updated ruff version and fixed the required code accorindg to the latest version. * Updated ruff version and fixed the required code accorindg to the latest version. * Added noqa directive to ignore 1 error shown by ruff commit 9cf4f2aa9a9cecbb22e813931ef3bb72fc773540 Author: RhuiDih <[email protected]> Date: Tue Jul 23 21:56:41 2024 +0800 Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629) * add DataCollatorBatchFlattening * Update data_collator.py * change name * new FA2 flow if position_ids is provided * add comments * minor fix * minor fix data collator * add test cases for models * add test case for data collator * remove extra code * formating for ruff check and check_repo.py * ruff format ruff format tests src utils * custom_init_isort.py commit 7d92009af647167bae338e9d4af8bc0452c62fbf Author: Deep Gandhi <[email protected]> Date: Tue Jul 23 19:11:52 2024 +0530 Added additional kwarg for successful running of optuna hyperparameter search (#31924) Update integration_utils.py Added additional kwarg commit 63700628adb91600c84fe3bbbc4c667cd3e3aa71 Author: Alvaro Moran <[email protected]> Date: Tue Jul 23 14:18:19 2024 +0200 feat(cache): StaticCache uses index_copy_ to avoid useless copy (#31857) * feat(cache): StaticCache uses index_copy_ to avoid useless copy Using index_copy_ allows for explicit in-place change of the tensor. Some backends (XLA) will otherwise copy the tensor, making the code slower and using more memory. Proposed implementation will end up using less memory and on XLA will result in less compilation, but the change is also quite generic, making no change whatsoever on CUDA or CPU backend. * feat(cache): SlidingWindowCache uses index_copy_ to avoid useless copy Applying the same change done in StaticCache. * fix(cache): fallback of index_copy_ when not implemented * fix(cache): in index_copy_ ensure tensors are on same device * [run slow] llama * fix(cache): add move of cache_position to same device in SlidingWindowCache * Revert "[run slow] llama" This reverts commit 02608dd14253ccd464e31c108e0cd94364f0e8b9. commit a009fbdab32a4b068c24052a4dfe7a7bc0fc89f9 Author: amyeroberts <[email protected]> Date: Tue Jul 23 12:23:34 2024 +0100 Fix typing to be compatible with later py versions (#32155) commit 3263b3435473cbb5dc66925bc29c1d32b5b8d431 Author: Sanchit Gandhi <[email protected]> Date: Tue Jul 23 18:34:30 2024 +0800 Revert "Incorrect Whisper long-form decoding timestamps " (#32148) Revert "Incorrect Whisper long-form decoding timestamps (#32003)" This reverts commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8. commit 034b47784765e37ecc20f7ad43640f1a2c0094fd Author: Amit Garg <[email protected]> Date: Tue Jul 23 03:33:22 2024 -0700 Rename Phi-3 rope scaling type (#31436) * renamed phi3 rope_scaling type * fixed trailing whitespaces * fixed test * added warning * fixed format commit bab32d6fe932a3372fbd6d5a84e3cacb12a61ae0 Author: Alexandre TL <[email protected]> Date: Tue Jul 23 12:32:19 2024 +0200 Added mamba.py backend (#30139) * Update README.md * tests: forward ok * backward test done * done testing * removed check. scripts * Update README.md * added use_mambapy arg * fixed typo in warning * protected imports w/ mambapy package * delete pscan.py + raise rather than assert * Update import_utils.py * fix whitespaces and unused import * trailing whitespace + import block unformatted * Update modeling_mamba.py * transpose before pscan * shape comment * ran make style * use_mambapy=False by default Co-authored-by: Arthur <[email protected]> * ran make fix-copies --------- Co-authored-by: Arthur <[email protected]> commit 9ced33ca7f909d9ace743dac083daba99c904d46 Author: Merve Noyan <[email protected]> Date: Tue Jul 23 13:23:23 2024 +0300 Fix video batching to videollava (#32139) --------- Co-authored-by: Merve Noyan <[email protected]> commit a5b226ce9811aa6b31af0bc9c09c54493a4e67c1 Author: Cyril Vallez <[email protected]> Date: Tue Jul 23 12:21:23 2024 +0200 Fix flash attention speed issue (#32028) Add the lru_cache for speed commit a1844a3209eb7e75582684809203bc189931a90c Author: Ita Zaporozhets <[email protected]> Date: Tue Jul 23 11:45:54 2024 +0200 gguf conversion add_prefix_space=None for llama3 (#31937) * gguf conversion forces add_prefix_space=False for llama3, this is not required and forces from_slow, which fails. changing to None + test * typo * clean test commit 2e113422b3504fe6de821bb9911b24273b11aa9c Author: Joao Gante <[email protected]> Date: Tue Jul 23 10:42:55 2024 +0100 Llama: RoPE refactor (#32135) Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Arthur <[email protected]> commit 5a4a76edb7ac6bbc764392e89adc11adda91f3e5 Author: bayllama <[email protected]> Date: Tue Jul 23 02:28:44 2024 -0700 Modify resize_token_embeddings to ensure output type is same as input (#31979) * Change resize_token_embeddings to make it return same Class that is passed to it * Add explanatory comment as requested in review * Add explanatory comments for add resizing function in lxmert * Add comment for padding_idx and moving _resize_bias in lxmert to LxmertForPreTraining --------- Co-authored-by: Prashanth Sateesh <[email protected]> Co-authored-by: Prashanth Sateesh <[email protected]> commit 1535a2c93d325e529dc9a1907f99247fdf8a58e7 Author: Daniel Lok <[email protected]> Date: Tue Jul 23 17:26:00 2024 +0800 Disable quick init for TapasPreTrainedModel (#32149) add attribute to model Signed-off-by: Daniel Lok <[email protected]> commit 34b43211d782c00da6fef778dbfaff69bbf3f115 Author: mig-mfreitas <[email protected]> Date: Tue Jul 23 10:07:58 2024 +0100 Add YaRN and Dynamic-YaRN RoPE Scaling Methods (#30910) * Add YaRN and Dynamic-YaRN RoPE Scaling Methods YaRN (Yet another RoPE extension method) combines the NTK-By-Parts Interpolation and Attention Scaling methods, improving upon existing RoPE interpolation methods for longer context window sizes. Fine-tuned models maintain their original performance across benchmarks while enabling efficient extrapolation and transfer learning for quicker convergence, especially in compute-limited environments. We implement YaRN and Dynamic-YaRN for the following list of models: - LLaMA - Falcon - GPT-NeoX - Olmo - Persimmon - Phi - StableLM - OpenLLaMA New unit tests are added to assert YaRN's correct behavior on both short and long sequence inputs. For more details, please refer to https://arxiv.org/abs/2309.00071. Co-authored-by: Miguel Almeida <[email protected]> * Refactor YaRN implementation for LLaMA Iterate on YaRN implementation for LLaMA and remove diff from remaining models for increased PR modularity. This commit includes the following changes: - Merge 'yarn_rope_scaling' and 'rope_scaling' dictionaries - Remove unnecessary attributes ('extrapolation_factor' and 'finetuned') from YaRN classes - Inherit 'forward' method in YaRN classes from superclass - Rename 'yarn' method to 'compute_yarn_scaling' - Extend YaRN tests with further assertions - Fix style inconsistencies Co-authored-by: Miguel Monte e Freitas <[email protected]> * Refactor Tensor Building Logic for YaRN - Comply with the the tensor building logic introduced in #30743 - Add referencing to the optimized Attention Factor equation - Remove Dynamic YaRN for a more agile deployment Co-authored-by: mig-mfreitas <[email protected]> * remove unwanted file --------- Co-authored-by: Miguel Almeida <[email protected]> Co-authored-by: mig-mfreitas <[email protected]> Co-authored-by: Joao Gante <[email protected]> commit 7405c1c77e4637768ea0ad5d27d8a4d8d67bfb19 Author: KonradSzafer <[email protected]> Date: Tue Jul 23 10:56:21 2024 +0200 Add method to retrieve used chat template (#32032) encapsulate chat template logic commit 605f3245dcca34381c35520c35ba0b701ed80d58 Author: Anton Vlasjuk <[email protected]> Date: Tue Jul 23 10:11:12 2024 +0200 Fix mask creations of `GPTNeoX` and `GPT2` (#31944) * fix mask creation of gpt2 and gpt_neox caused by me * forgot the reshape of masks when shape > 2 * add tests for gpt neox and gpt2 * nit on a comment commit 2782aadae2b0b0c313eac3ee70f84f0335577635 Author: Sanchit Gandhi <[email protected]> Date: Tue Jul 23 14:55:16 2024 +0800 [modelling] remove un-necessary transpose for fa2 attention (#31749) * [whisper] remove un-necessary transpose for fa2 attention * propagate commit f83c6f1d02fba5e5ced9357b9c9196c76d937af3 Author: Sanchit Gandhi <[email protected]> Date: Tue Jul 23 14:54:38 2024 +0800 Remove `trust_remote_code` when loading Libri Dummy (#31748) * [whisper integration] use parquet dataset for testing * propagate to others * more propagation * last one commit 3aefb4ec7f957f9561a410eabc6f9d57b2f0384f Author: Raushan Turganbay <[email protected]> Date: Tue Jul 23 10:23:55 2024 +0500 LLaVaNeXT: pad on right if training (#32134) * pad on right if training * docs * add tests commit 251a2409c694c29ee28e66c954670c483cf54961 Author: James Thewlis <[email protected]> Date: Tue Jul 23 01:12:16 2024 -0400 Add llama3-llava-next-8b to llava_next conversion script (#31395) * Add llama3-llava-next-8b to llava_next conversion script Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo. * Exclude <|begin_of_text|> from prompt example This token gets added automatically, so it should not be included in the prompt example. * Add llava-next-72b and llava-next-110b Adds the Qwen-based LLaVA-Next models to the conversion script, along with changes to load the models on multiple GPUs for inference. * Add llama3 and qwen prompt formats to docs * Chat prompt and padding side left for llama3 batched * update * Update src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py Co-authored-by: amyeroberts <[email protected]> * remove code * better naming --------- Co-authored-by: raushan <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: amyeroberts <[email protected]> commit 96a074fa7e2c04b904f72d9e827398d4c5f90f25 Author: Marc Sun <[email protected]> Date: Mon Jul 22 20:21:59 2024 +0200 Add new quant method (#32047) * Add new quant method * update * fix multi-device * add test * add offload * style * style * add simple example * initial doc * docstring * style again * works ? * better docs * switch to non persistant * remove print * fix init * code review commit bd9dca3b855b5a20ea11097b89c40f34d775f1c7 Author: Arthur <[email protected]> Date: Mon Jul 22 19:42:47 2024 +0200 set warning level to info for special tokens have been added (#32138) fixes #7002 commit 817a676bd711f9626e13578068b36ef09cf572dc Author: amyeroberts <[email protected]> Date: Mon Jul 22 18:29:50 2024 +0100 Don't default to other weights file when use_safetensors=True (#31874) * Don't default to other weights file when use_safetensors=True * Add tests * Update tests/utils/test_modeling_utils.py * Add clarifying comments to tests * Update tests/utils/test_modeling_utils.py * Update tests/utils/test_modeling_utils.py commit 74d0eb3fedf353bd670aa85ae8fcf4c85f287b5b Author: Yoni Gottesman <[email protected]> Date: Mon Jul 22 20:24:43 2024 +0300 Return assistant generated tokens mask in apply_chat_template (#30650) return assistant generated tokens mask in apply_chat_template commit 7987710696803c74ce1b5e7f9dfa055096a6c00e Author: Bertrand Thia <[email protected]> Date: Mon Jul 22 13:08:27 2024 -0400 [RoBERTa] Minor clarifications to model doc (#31949) * minor edits and clarifications * address comment Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Steven Liu <[email protected]> commit 12b6880c81db7742a29ea425dcb9e63b7dbdc449 Author: Sai-Suraj-27 <[email protected]> Date: Mon Jul 22 22:16:17 2024 +0530 fix: Fixed raising `TypeError` instead of `ValueError` for invalid type (#32111) * Raised TypeError instead of ValueError for invalid types. * Updated formatting using ruff. * Retrieved few changes. * Retrieved few changes. * Updated tests accordingly. commit d1ec36b94f5ba45fb2423e74074cfedab48cfe73 Author: Woojun Jung <[email protected]> Date: Tue Jul 23 00:27:13 2024 +0900 Update `ko/_toctree.yml` and remove `custom_tools.md` to reflect latest changes (#31969) update `ko/_toctree.yml` and remove `custom_tools.md` commit 7ba028fccb82cbee792b67d596120da8ae9397c9 Author: Matt <[email protected]> Date: Mon Jul 22 16:07:29 2024 +0100 Fix failing test with race condition (#32140) * Fix failing test with race condition * make fixup * monotonic_ns instead of randint * uuid4 instead of monotonic_ns * Add a finally cleanup step commit 5a649ff3ecd70599dd0fea7ee430ba47b51a4556 Author: Sanchit Gandhi <[email protected]> Date: Mon Jul 22 21:18:48 2024 +0800 [generate] fix eos/pad id check on mps devices (#31695) Co-authored-by: Joao Gante <[email protected]> commit f2a1e3ca684df624016285266a0ae519e4483be7 Author: Lucain <[email protected]> Date: Mon Jul 22 15:14:47 2024 +0200 Mention model_info.id instead of model_info.modelId (#32106) commit 0fcfc5ccc968ff5a1a439db04a94f566a0bd1d89 Author: Sai-Suraj-27 <[email protected]> Date: Mon Jul 22 18:43:39 2024 +0530 fix: Replaced deprecated `mktemp()` function (#32123) Replaced deprecated mktemp function. commit c38c55f4fbc0163cc02ef4588fe2ec391171a2f0 Author: Joao Gante <[email protected]> Date: Mon Jul 22 14:06:49 2024 +0100 Generate: store special token tensors under a unique variable name (#31980) * rename stuff * english; this one shouldn't be changed * add a _ to the new var names * musicgen * derp commit aa8f86a421e23fe41b6333efc11ea4248e098d83 Author: Brian <[email protected]> Date: Mon Jul 22 08:06:22 2024 -0400 Fix shard order (#32023) commit b3818805978b411713725a1b7470dc1bda073c29 Author: Aymeric Roucher <[email protected]> Date: Mon Jul 22 10:49:57 2024 +0200 Agents planning (#31702) * Allow planning for agents commit 0fdea8607d7e01eb0e38a1ebeb7feee30a22f0cf Author: Lucain <[email protected]> Date: Fri Jul 19 20:32:39 2024 +0200 Fix tests after `huggingface_hub` 0.24 (#32054) * adapt tests * style * comment commit fe008d6ebea1f5770b740991daeefd9322fa434a Author: Raushan Turganbay <[email protected]> Date: Fri Jul 19 19:21:45 2024 +0500 Chameleon: not supported with fast load (#32091) fixes commit 62aa270f2ab3acca2a58cde8f08400ec49330b03 Author: Zach Mueller <[email protected]> Date: Fri Jul 19 08:58:53 2024 -0400 Disable quick init for deepspeed (#32066) Disable via deepspeed commit 89575b567e061fd87bdd655ba188b6c7a922d54a Author: Kamil Akesbi <[email protected]> Date: Fri Jul 19 13:42:22 2024 +0100 Support generating with fallback for short form audio in Whisper (#30984) * remove is_shortform * adapt _retrieve_max_frames_and_seek for short_form * return bos token in short and long form * add decoder_input_ids to short form audios * add eos token for short form * handle short form token_timestamps * no need to return scores * add is_shortform conditions * handle when max_new_tokens is None - short form * handle assistant decoding * fix * handle return_dict_in_generate * handle split_by_batch for encoder_attentions attribute * handle num_beams>1 * handle num_return_sequences>1 in generate_with_fallback * handle num_return_sequences>1 with return_dict_in_generate=True * raise error if max_new_tokens + decoder_inputs_ids > max_target_pos * fix * apply review suggestions * fix * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * fix * logits for both short form and long form * handle if logits_processor is None * test * apply review changes to num_return_sequences * add _expand_variables_for_generation * remove short form commented section * update comments * uncomment num_beams line in generate_with_fallback * update assistant decoding * handle return_segment with short form generation * up * fix output format is_shortform * overwrite beam_sample test * update _set_return_timestamps * apply review suggestions * apply review suggestions * remove seek_outputs_short_form * fix _stack_split_outputs * fix stack dim in _stack_split_outputs * update tests * fix past_key_values + beam tests * fix * clean _expand_variables_for_generation * make style * fix slow tests * make style * max_length condition * make style * add slow tests for shortform fallback * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * apply review changes * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <[email protected]> * up * fix slow tests * apply review suggestions * update test * make style * small fix * fix * fix test_new_cache_format * fix past_key_values * fix * make style * fix slow tests * fix --------- Co-authored-by: Sanchit Gandhi <[email protected]> commit 46835ec6aed62e9a73784f1b6a43030afd601e5e Author: Merve Noyan <[email protected]> Date: Fri Jul 19 15:40:40 2024 +0300 Add image-text-to-text task guide (#31777) * Add image-text-to-text task page * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <[email protected]> * Address comments * Fix heading * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <[email protected]> * Address comments * Update image_text_to_text.md --------- Co-authored-by: Steven Liu <[email protected]> Co-authored-by: amyeroberts <[email protected]> commit 4bd8f12972c6ad06e264baa39f17ec9dfa9a5cb2 Author: Merve Noyan <[email protected]> Date: Fri Jul 19 14:50:34 2024 +0300 Fixes to chameleon docs (#32078) * Fixes * Let's not use auto commit 566b0f1fbf5feb53a18591ca215a8d1245a790ef Author: Keith Stevens <[email protected]> Date: Fri Jul 19 03:56:45 2024 -0700 Fix progress callback deepcopy (#32070) * Replacing ProgressCallbacks deepcopy with a shallowcopy * Using items instead of entries * code cleanup for copy in trainer callback * Style fix for ProgressCallback commit e316c5214fe51de0bf8e824245bfd6225c9925aa Author: Raushan Turganbay <[email protected]> Date: Fri Jul 19 15:38:01 2024 +0500 VideoLLaVa: fix chat format in docs (#32083) fix chat format commit 22f888b3fab3d914882b8f44896a5658712f535c Author: Joshua Lochner <[email protected]> Date: Fri Jul 19 11:19:35 2024 +0200 [mistral] Fix FA2 attention reshape for Mistral Nemo (#32065) * [mistral] Fix FA2 attention reshape * [run-slow] mistral commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8 Author: Kamil Akesbi <[email protected]> Date: Fri Jul 19 09:26:38 2024 +0100 Incorrect Whisper long-form decoding timestamps (#32003) * fix lo form timestamps in decode_batch * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: Yoach Lacombe <[email protected]> * add test * make style * fix copies * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/whisper/processing_whisper.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: amyeroberts <[email protected]> * apply review suggestions * fix * fix copies * fix * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: amyeroberts <[email protected]> * fix-copies --------- Co-authored-by: Yoach Lacombe <[email protected]> Co-authored-by: amyeroberts <[email protected]> commit 56a7745704261919dd8117e3a8aa4fb43fade30e Author: NielsRogge <[email protected]> Date: Fri Jul 19 10:20:03 2024 +0200 [Chameleon, Hiera] Improve docs (#32038) * Improve docs * Fix docs * Fix code snippet commit b873234cb649a24865021f0d598627ce2b24d34a Author: Raushan Turganbay <[email protected]> Date: Fri Jul 19 10:08:56 2024 +0500 Llava: add default chat templates (#31691) * add default chat templates * Update src/transformers/models/llava/processing_llava.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/llava_next/processing_llava_next.py Co-authored-by: amyeroberts <[email protected]> * more clear docstring and docs * Update docs/source/en/model_doc/llava.md Co-authored-by: NielsRogge <[email protected]> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: NielsRogge <[email protected]> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: NielsRogge <[email protected]> * add tests * remove default templates (see #31733) * load chat template from another file * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <[email protected]> * revert some changes in docs * forgot vipllava * chat template file is not temporary hack * warn if loading from processor * not that file * similarly modify `save_pretrained` * Update tests/models/llava_next/test_processor_llava_next.py Co-authored-by: amyeroberts <[email protected]> * Update tests/models/vipllava/test_processor_vipllava.py Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@use…

* Added mamba.py backend (#30139) * Update README.md * tests: forward ok * backward test done * done testing * removed check. scripts * Update README.md * added use_mambapy arg * fixed typo in warning * protected imports w/ mambapy package * delete pscan.py + raise rather than assert * Update import_utils.py * fix whitespaces and unused import * trailing whitespace + import block unformatted * Update modeling_mamba.py * transpose before pscan * shape comment * ran make style * use_mambapy=False by default Co-authored-by: Arthur <[email protected]> * ran make fix-copies --------- Co-authored-by: Arthur <[email protected]> * Rename Phi-3 rope scaling type (#31436) * renamed phi3 rope_scaling type * fixed trailing whitespaces * fixed test * added warning * fixed format * Revert "Incorrect Whisper long-form decoding timestamps " (#32148) Revert "Incorrect Whisper long-form decoding timestamps (#32003)" This reverts commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8. * Fix typing to be compatible with later py versions (#32155) * feat(cache): StaticCache uses index_copy_ to avoid useless copy (#31857) * feat(cache): StaticCache uses index_copy_ to avoid useless copy Using index_copy_ allows for explicit in-place change of the tensor. Some backends (XLA) will otherwise copy the tensor, making the code slower and using more memory. Proposed implementation will end up using less memory and on XLA will result in less compilation, but the change is also quite generic, making no change whatsoever on CUDA or CPU backend. * feat(cache): SlidingWindowCache uses index_copy_ to avoid useless copy Applying the same change done in StaticCache. * fix(cache): fallback of index_copy_ when not implemented * fix(cache): in index_copy_ ensure tensors are on same device * [run slow] llama * fix(cache): add move of cache_position to same device in SlidingWindowCache * Revert "[run slow] llama" This reverts commit 02608dd14253ccd464e31c108e0cd94364f0e8b9. * Added additional kwarg for successful running of optuna hyperparameter search (#31924) Update integration_utils.py Added additional kwarg * Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629) * add DataCollatorBatchFlattening * Update data_collator.py * change name * new FA2 flow if position_ids is provided * add comments * minor fix * minor fix data collator * add test cases for models * add test case for data collator * remove extra code * formating for ruff check and check_repo.py * ruff format ruff format tests src utils * custom_init_isort.py * Updated `ruff` to the latest version (#31926) * Updated ruff version and fixed the required code accorindg to the latest version. * Updated ruff version and fixed the required code accorindg to the latest version. * Added noqa directive to ignore 1 error shown by ruff * Dev version: v4.44.0.dev0 * Llama 3.1 conversion Co-authored-by: Arthur Zucker <[email protected]> * fix (#32162) * fix: Fixed an if condition that is always evaluating to true (#32160) Fixed an if condition always evaluating to true. * [docs] change temperature to a positive value (#32077) fix * adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer (#32171) * adds: extra_repr() to MambaRMSNorm to include the hidden size of the layer * style fix with ruff: * fix: default value reflects the runtime environment variables rather than the ones present at import time. (#32153) * fix: default value reflects the runtime environment variables rather than the ones present at import time. * Fix: Change `deterministic` to None by default; use env var if None * Update qwen2.md (#32108) * Update qwen2.md outdated description * Update qwen2.md amended * Update qwen2.md Update * Update qwen2.md fix wrong version code, now good to go * Remove conversational pipeline tests (#32099) Remove conversation pipeline tests * RoPE: relaxed rope validation (#32182) * relaxed rope check * lets also accept rope_type=None, defaulting to the original implementation * type and rope_type can coexist * let's not warn when someone is running a forward (#32176) * let's not warn when someone is running a foward without cache + self.training * more models * fixup * Fix resize embedding with Deepspeed (#32192) fix resize when deepspeed * Fix float8_e4m3fn in modeling_utils (#32193) * Fix float8_e4m3fn in modeling_utils * style * fix * comment * Support dequantizing GGUF FP16 format (#31783) * support gguf fp16 * support gguf bf16 with pytorch * add gguf f16 test * remove bf16 * :rotating_light: No more default chat templates (#31733) * No more default chat templates * Add the template to the GPT-SW3 tests since it's not available by default now * Fix GPT2 test * Fix Bloom test * Fix Bloom test * Remove default templates again * fix: Replaced deprecated `unittest method` with the correct one (#32198) Replaced deprecated unittest method with the correct one. * [whisper] fix short-form output type (#32178) * [whisper] fix short-form output type * add test * make style * update long-form tests * fixes * last fix * finalise test * remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 (#32210) remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 * Update question_answering.py (#32208) * [BigBird Pegasus] set _supports_param_buffer_assignment to False (#32222) set _supports_param_buffer_assignment to False * [warnings] fix E721 warnings (#32223) fix E721 warnings * Follow up for #31973 (#32025) * fix * [test_all] trigger full CI --------- Co-authored-by: ydshieh <[email protected]> * translate philosophy.md to chinese (#32177) * translate philosophy.md to chinese * add the missing link * Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac (#31846) * use currently active microphone on mac for ffmpeg_microphone * Allow ffmpeg_microphone device to be specified Co-authored-by: amyeroberts <[email protected]> --------- Co-authored-by: amyeroberts <[email protected]> * Fix code snippet for Grounding DINO (#32229) Fix code snippet for grounding-dino * Generation: stop at `eos` for assisted decoding (#31301) * fix * move changes to prompt lookup * add test * set eos in assistant model * style * fix flakiness * changes for new `main` * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <[email protected]> * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <[email protected]> * add comment to explain --------- Co-authored-by: amyeroberts <[email protected]> * Llava: generate without images (#32183) * llava w/o images * tests * Resize embeds with DeepSpeed (#32214) * fix resize when deepspeed * deepsped uses new embeds * we needed this * don't log base model architecture in wandb if log model is false (#32143) * don't log base model architecture in wandb is log model is false * Update src/transformers/integrations/integration_utils.py Co-authored-by: amyeroberts <[email protected]> * convert log model setting into an enum * fix formatting --------- Co-authored-by: amyeroberts <[email protected]> * Refactor: Removed un-necessary `object` base class (#32230) * Refactored to remove un-necessary object base class. * small fix. * Adds: extra_repr for RMSNorm layers in most models (#32204) * adds: extra_repr() to RMSNorm layers in multiple models * adds: extra_repr for deprecated models as well * formatting as per style guide * Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 (#31934) * Add check for target_sizes is None in post_process_image_guided_detection * Make sure Owlvit and Owlv2 in sync * Fix incorrect indentation; add check for correct size of target_sizes * [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` (#32039) * add flash attention check * fix * fix * Flash-Attn: fix generation when no attention mask or no pading (#32241) * fix * fix prev test (half of failures) * [run-slow] llama, gemma2 * [run-slow] llama, gemma2 * More flexible trigger condition (#32251) update Co-authored-by: ydshieh <[email protected]> * Llama 3.1: replace for loop by tensor ops at inv_freq initialization (#32244) * replace for loop by tensor ops * rm assert; readability * 🚨 Bloom support for cache class (#31445) * bloom dynamic cache * bloom follows standard cache format * no skips for bloom anymore * use cache position when possible * clean up * codestyle * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <[email protected]> * pr comments * isinstance fix * address comments * make musicgen test happy * [run-slow] bloom --------- Co-authored-by: amyeroberts <[email protected]> * Upload new model failure report to Hub (#32264) upload Co-authored-by: ydshieh <[email protected]> * Optimize t5 tokenize logic to avoid redundant calls (#32270) * Optimize t5 tokenize logic to avoid redundant calls * fix and overwrite copies * fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call (#32262) Removed one wrong argument passed to convert_blip_checkpoint function call. * Repo: remove exceptions in `check_docstrings` (#32259) remove exceptions * make `p_mask` a numpy array before passing to `select_starts_ends` (#32076) * fix * bug fix * refine * fix * fix(docs): Fixed a link in docs (#32274) Fixed a link in docs. * Generate: end-to-end compilation (#30788) * mvp * added test (a few models need fixes) * fix a few test cases * test nits * harder test 😈 * revert changes in stablelm * test with improved condition * add todo * tmp commit * merged with main * nits * add todo * final corrections * add docs for generation compilation * docs nits * add tip * PR suggestions * add more details to the compilation docs * fix cache positions * cache is now init in generate; update docs * tag test as flaky * docs * post rebase make fixup and other nits * remove unintended changes * whisper (encoder-decoder) not supported * move token default updates to ; add tests for token defaults * push changes * manual rebase * chameleon doesn't support this * fix test_static_cache_mha_mqa_gqa (broken in another PR) * docs: dynamic is better with end-to-end compilation * Whisper tokenizer word level timestamps (#32197) * fix _fix_key in PreTrainedModel * fix _find_longest_common_sequence * add test * remove result.json * nit * update test * [pipeline] fix padding for 1-d tensors (#31776) * [pipeline] fix padding for 1-d tensors * add test * make style * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py Co-authored-by: Kamil Akesbi <[email protected]> * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py --------- Co-authored-by: Kamil Akesbi <[email protected]> * Make static cache compatible with torch.export (#32168) * Add stream messages from agent run for gradio chatbot (#32142) * Add stream_to_gradio method for running agent in gradio demo * use torch 2.4 in 2 CI jobs (#32302) Co-authored-by: ydshieh <[email protected]> * Docs: fix GaLore optimizer code example (#32249) Docs: fix GaLore optimizer example Fix incorrect usage of GaLore optimizer in Transformers trainer code example. The GaLore optimizer uses low-rank gradient updates to reduce memory usage. GaLore is quite popular and is implemented by the authors in [https://github.com/jiaweizzhao/GaLore](https://github.com/jiaweizzhao/GaLore). A few months ago GaLore was added to the HuggingFace Transformers library in https://github.com/huggingface/transformers/pull/29588. Documentation of the Trainer module includes a few code examples of how to use GaLore. However, the `optim_targe_modules` argument to the `TrainingArguments` function is incorrect, as discussed in https://github.com/huggingface/transformers/pull/29588#issuecomment-2006289512. This pull request fixes this issue. * Fix GGUF dequantize for `gguf==0.9.1` (#32298) * fix gguf dequantize for gguf==0.9.1 * fix old version * make style * Cast epochs_trained to int when resuming training (#32286) * fix epochs_trained as int when resuming training * refactor --------- Co-authored-by: teddyferdinan <[email protected]> * feat(ci): set `fetch-depth: 0` in trufflehog checkout step (#31663) * Fix M4T for ASR pipeline (#32296) * tentative fix * do the same for M4T * Docs: formatting nits (#32247) * doc formatting nits * ignore non-autodocs * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <[email protected]> * make fixup --------- Co-authored-by: amyeroberts <[email protected]> * Alternative agent plan (#32295) * new agent plan * plan type assertion * style corrections * better prompt naming * make fixup * fix: Added missing raise keyword for few exceptions (#32333) Fixed raising of few exceptions. * fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276) * fixes #32329 : The Torch code is correct - to get an average of 10% o… (#32335) fixes #32329 : The Torch code is correct - to get an average of 10% of the total, we want to take 50% of the remainder after we've already masked 80% with [MASK] in the previous step. * Repo checks: skip docstring checks if not in the diff (#32328) * tmp * skip files not in the diff * use git.Repo instead of an external subprocess * add tiny change to confirm that the diff is working on pushed changes * add make quality task * more profesh main commit reference * Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191) * Remove user-defined tokens which can be obtained through merges * Remove debug line * formatting * Refactor spm slow -> fast converter * revert unnecessary refactor * set comprehension * remove test files * Use `vocab_scores` * Always replace spiece underline with space in decode * we no longer need token filtering * Add save fast load slow unit test * Remove tokenizers version check * Remove duplicate code * Make `<start_of_turn>` and `<end_of_turn>` special tokens * Bias merge priority with length if score is the same * Add unit test for merge priority * CI * LLaVA-NeXT: fix anyres shapes (#32314) fix * Gemma2 and flash-attention (#32188) * enable flash-attn & static cache * this works, not the prev * fix for sliding window layers * not needed anymore * Llama 3.1: Fix incorrect `inv_freq` assignment (#32330) fix 💩 * [Idefics2] - Fix FA2 call for Perceiver layer (#32275) * Fix FA2 call for Perciever layer * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 * Fix up * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 * Gemma 2: support assisted generation (#32357) * Fix error when streaming to gradio with non-string tool arguments (#32360) Fix error when streaming agent run to gradio with non-string tool arguments * >3-5x faster torch.compile forward compilation for autoregressive decoder models (#32227) * draft * apply changes to all relevant archs * rerun ci - check_docstrings.py failing? * fix docstring * move 2D->4D mask creation to modeling file * repo consistency * fix the batch size = 1 case - calling contiguous is not enough * nit * style * propagate to gemma/gemma-2 * prepare inputs for gemma generation * implement test and tiny fix in gemma2 * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: Arthur <[email protected]> * fix copies * ci pass * fix gemma's test_compile_static_cache tests * flacky * retrigger ci --------- Co-authored-by: sanchit-gandhi <[email protected]> Co-authored-by: Arthur <[email protected]> * fix: Removed unnecessary `@staticmethod` decorator (#32361) * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * fix: warmup_steps check for training_args (#32236) * LLaVa: add cache class attribute (#32278) cache class flag * [enc-dec cache] fix bug in indexing (#32370) * [whisper] compile compatibility with long-form decoding (#31772) * [whisper] compile compatibility with long-form decoding * clarify comment * fix after rebase * finalise * fix bsz * fix cache split * remove contiguous * style * finish * update doc * prevent cuda graph trace * Remove size check between attn_weights and kv_seq_len for phi3 (#32339) * Remove size check between attn_weights and kv_seq_len * add unit tests * add missing attribute _supports_param_buffer_assignment for gpt-j. (#32359) Co-authored-by: Guoming Zhang <[email protected]> * Check device map for saving tokenizer config on TPU (fix for issue #31971) (#32043) * Remove TPU device map for saving tokenizer config * Update tokenization_utils_base.py * Fix error msg when passing non-string device into tokenizer * Fix error message for non-string tokenizer device * Print out tokenizer device type in error msg * Update tokenization_utils_base.py * update clean_up_tokenization_spaces warning (#32371) * Empty list in defaults for LLaMA special tokens during weights conversion (#32342) empty list in defaults * Fix conflicting key in init kwargs in PreTrainedTokenizerBase (#31233) * Fix conflicting key in init kwargs in PreTrainedTokenizerBase * Update code to check for callable key in save_pretrained * Apply PR suggestions * Invoke CI * Updates based on PR suggestion * Offloaded KV Cache (#31325) * Initial implementation of OffloadedCache * enable usage via cache_implementation * Address feedback, add tests, remove legacy methods. * Remove flash-attn, discover synchronization bugs, fix bugs * Prevent usage in CPU only mode * Add a section about offloaded KV cache to the docs * Fix typos in docs * Clarifications and better explanation of streams * Docker: add `speech` dep to the consistency docker image (#32374) * Fixed Hybrid Cache Shape Initialization. (#32163) * fixed hybrid cache init, added test * Fix Test Typo --------- Co-authored-by: Aaron Haag <[email protected]> * Yell at the user if zero-3 init wasn't performed, but expected to have been done (#32299) * Test this zach * Test for improper init w/o zero3 * Move back * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Get rid of stars in warning * Make private * Make clear --------- Co-authored-by: amyeroberts <[email protected]> * Update docs (#32368) nits * RoPE: Add numerical tests ✨ (#32380) tests! :D * [generate] only require an attention mask for mps with torch<2.4 (#32367) * up * style * stopping * fix: (issue #32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. (#32157) fix: Exception raised when running . * MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. (#31500) * Mixtral: remove unnecessary plus 1 when calculating rotary_seq_len, allowing position_ids=None (no auto position_ids generation could be unsafe) * fix typo [:-1] to [:, -1] * to meet formatting requirement * to meet formatting requirement * remove white space * MixtralFlashAttention2: put "+ 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. Fix format/style issue. * propagate to startcoder2, phi3, mixtral and qwen2 * update qwen2_moe * Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer (#32393) Bump keras in /examples/research_projects/decision_transformer Bumps [keras](https://github.com/keras-team/keras) from 2.8.0 to 2.13.1. - [Release notes](https://github.com/keras-team/keras/releases) - [Commits](https://github.com/keras-team/keras/compare/v2.8.0...v2.13.1) --- updated-dependencies: - dependency-name: keras dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: SeamlessM4TFeatureExtractor stride remainder (#32088) * fix: SeamlessM4TFeatureExtractor stride remainder * Added attention mask size test * Reran ruff for style correction * Phi3 tests: fix typing for Python 3.8 (#32388) fix phi * #32184 save total_vocab_size (#32240) * save total_vocab_size = vocab_size + user added tokens to speed up operation * updating length when added_tokens_decoder is set * add test len(tokenizer) * add values for neftune (#32399) I always forget what typical values are, and I have to look at the paper everytime. This will be a helpful reminder. * Fix documentation references to google/bit-50 model (#32407) * Persist embedding type of BART and mBART models after resize (#32242) * fix: persist embedding type of MBartConditonalGeneration after resize * fix: persist embedding type of BartConditonalGeneration after resize * fix: Updated `test_embeded_special_tokens` for luke and mluke models (#32413) Fixed tokenizertests for luke, mluke models. * Respect the config's attn_implementation if set (#32383) * Respect the config's attn if set * Update test - can override in from_config * Fix * Fix documentation links and code reference to model llava-next (#32434) * Cache: create docs (#32150) * draft * updates * works? * try adding python example in hidden section * another try * hwo do i render python * format as html code? * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante <[email protected]> * one more small update * should render hidden secrtion now * add outputs * fix links * check links * update all links * update with offloaded cache * all cache is importable, so they appear in docs * fix copies * docstring... --------- Co-authored-by: Joao Gante <[email protected]> * Llava: fix checkpoint_doc (#32458) fix: add new llava like model bug * add the missing flash attention test marker (#32419) * add flash attention check * fix * fix * add the missing marker * bug fix * add one more * remove order * add one more * Update kwargs validation for `preprocess` with decorator (#32024) * BLIP preprocess * BIT preprocess * BRIDGETOWER preprocess * CHAMELEON preprocess * CHINESE_CLIP preprocess * CONVNEXT preprocess * DEIT preprocess * DONUT preprocess * DPT preprocess * FLAVA preprocess * EFFICIENTNET preprocess * FUYU preprocess * GLPN preprocess * IMAGEGPT preprocess * INTRUCTBLIPVIDEO preprocess * VIVIT preprocess * ZOEDEPTH preprocess * VITMATTE preprocess * VIT preprocess * VILT preprocess * VIDEOMAE preprocess * VIDEOLLAVA * TVP processing * TVP fixup * SWIN2SR preprocess * SIGLIP preprocess * SAM preprocess * RT-DETR preprocess * PVT preprocess * POOLFORMER preprocess * PERCEIVER preprocess * OWLVIT preprocess * OWLV2 preprocess * NOUGAT preprocess * MOBILEVIT preprocess * MOBILENETV2 preprocess * MOBILENETV1 preprocess * LEVIT preprocess * LAYOUTLMV2 preprocess * LAYOUTLMV3 preprocess * Add test * Update tests * Fix get large model config for Switch Transformer encoder only tester (#32438) * Dependencies: fix typo (#32389) deps_2 * Add Nemotron HF Support (#31699) * Add nemotron support * fix inference * add unit test * add layernorm1p as a class to avoid meta device mismatch * test fixed * Add copied_from statements * remove pretraining_tp args * remove nemotronlayernorm * force LN computation done in FP32 * remove nemotrontokenizer and use llamatokenizer * license update * add option for kv_channels for minitron8b * remove assert * o_proj fixed * o_proj reshape * add gated_proj option * typo * remove todos * fix broken test after merging latest main * remove nezha/nat after meging main * chnage default config to 15b model * add nemo conversion script * rename conversion script * remove gate_proj option * pr comment resolved * fix unit test * rename kv_channels to head_dim * resolve PR issue * add nemotron md * fix broken tests * refactor rope for nemotron * test fix * remove linearscaling * whitespace and import * fix some copied-from * code style fix * reformatted * add position_embedding to nemotronattention * rope refactor to only use config, copied-from fix * format * Run make fix-copies * nemotron md with autodoc * doc fix * fix order * pass check_config_docstrings.py * fix config_attributes * remove all llama BC related code * Use PreTrainedTokenizerFast * ruff check examples * conversion script update * add nemotron to toctree * Generate: fix end to end compilation (#32465) * Add codestral mamba2 (#32080) * add new model like * draft cuda forward - mismatched keys (sharding on conv1) * match keys successfully * fix split * get generation/forward running (wrong gens, norm?) * :update * some refactoring * fixes * works up until copy to cache * fix * update * NON WORKING VERSION * version that work? * nit * fix config * fix conversion script * working cuda forward * nit * update * simplifcation * make mamba slow simple work * no einops * todo * fix style * no einops * update fix no einsum * nit * remove einops * bug: scan_output differs strongly * add rms norm option * fix fast + slow generation with and w/o cache :heavy_check_mark: * draft integration tests * remove a big chunk of the einsum * fix slow, fast generations, without any einsum * fix copies * fix structure * fix up modeling and tests * fix tests * clamping is indeed worse * recover mamba2 cache test * fix copies * no cache position (yet) * fix tf tests * fix matmul for generate * fixup * skip cache tests for now * [run-slow]mamba2 * tune out hidden states for padding * test batched generation * propagate attention mask changes * fix past length * fix integration test * style * address comments * update readme * add mamba2 version check * fix tests * [run-slow]mamba2 * skip edge tests * [run-slow]mamba2 * last fixup * [run-slow]mamba2 * update README --------- Co-authored-by: Arthur Zucker <[email protected]> * Migrate import checks not need accelerate, and be more clear on min versions (#32292) * Migrate import checks to secondary accelerate calls * better errs too * Revert, just keep the import checks + remove accelerate-specific things * Rm extra' * Empty commit for ci * Small nits * Final * Documentation: BOS token_id deprecation change for NLLB (#32443) Update nllb.md * dev version 4.45.0 * `is_torchdynamo_compiling` -- cast a wide exception net (#32476) * cast a wide net * make fix-copies with a few manual changes * add copied from * Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) * Revert "fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276)" This reverts commit 62c60a30181a65e1a3a7f19c3055a240a6a21335. We uncovered an issue with this change that caused our training runs to hang. * `is_torchdynamo_compiling` -- cast a wide exception net (#32476) * cast a wide net * make fix-copies with a few manual changes * add copied from --------- Co-authored-by: Joao Gante <[email protected]> * 🌐 [i18n-KO] Translated `mask_generation.md` to Korean (#32257) * docs: ko: tasks/mask_generation.md * feat: nmt draft * fix : toc local * fix : manual edits * fix : ko-toctree * fix: resolve suggestions Co-authored-by: boyunJang <[email protected]> Co-authored-by: Chaewon Song <[email protected]> * fix: resolve suggestions Co-authored-by: boyunJang <[email protected]> Co-authored-by: Chaewon Song <[email protected]> * fix: resolve suggestions * fix: resolve suggestions * fix: resolve suggestions --------- Co-authored-by: boyunJang <[email protected]> Co-authored-by: Chaewon Song <[email protected]> * 🌐 [i18n-KO] Translated `idefics.md` to Korean (#32258) * docs: ko: tasks/idefics.md * feat: nmt draft * fix: manual edits * fix: resolve suggestions Co-authored-by: Chaewon Song <[email protected]> Co-authored-by: Harheem Kim <[email protected]> Co-authored-by: timdalxx <[email protected]> --------- Co-authored-by: Chaewon Song <[email protected]> Co-authored-by: Harheem Kim <[email protected]> Co-authored-by: timdalxx <[email protected]> * 🌐 [i18n-KO] Translated `image_to_image.md` to Korean (#32327) * docs: ko: tasks/image_to_image.md * feat: nmt draft * fix: manual edits * fix: resolve suggestions Co-authored-by: Jihun Lim <[email protected]> Co-authored-by: Jiwook Han <[email protected]> * fix: handle remaining suggestions Co-authored-by: Jiwook Han <[email protected]> --------- Co-authored-by: Jihun Lim <[email protected]> Co-authored-by: Jiwook Han <[email protected]> * Cache: new Cache format in decoder-only models (#31421) * draft bart with new cache * add cache for decoder-only models * revert utils * modify docstring * revert bart * minor fixes * fix copies (not related) * revert tests * remove enc-dec related code * remove bloom * remove opt (enc-dec) * update docstring * git, codegen, gpt_neo, gpt_neox, gpj * clean up * copied from statements * revert * tmp * update warning msg * forgot git * add more flags * run-slow git,codegen,gpt_neo,gpt_neox,gpj * add cache flag to VLMs * remove files * style * video LLMs also need a flag * style * llava will go in another PR * style * [run-slow] codegen, falcon, git, gpt_neo, gpt_neox, gptj, idefics * Update src/transformers/models/gpt_neo/modeling_gpt_neo.py Co-authored-by: Arthur <[email protected]> * copy from * deprecate until v4.45 and warn if not training * nit * fix test * test static cache * add more tests and fix models * fix copies * return sliding window mask * run slow tests & fix + codestyle * one more falcon fix for alibi --------- Co-authored-by: Arthur <[email protected]> * Gemma2: add cache warning (#32279) * gemma2 fallback to dynamic cache * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Joao Gante <[email protected]> * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Arthur <[email protected]> * raise error and dont fallback to dynamic cache * prev will break most forward calls/tests * Update src/transformers/models/gemma2/modeling_gemma2.py Co-authored-by: Arthur <[email protected]> * update * fix copies --------- Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Arthur <[email protected]> * enable xla fsdp (#32048) * enable xla fsdp * add acceleration version check for xla fsdp * Fix typo in tokenization_utils_base.py (#32484) * Agents use grammar (#31735) * Allow optional use of grammars to constrain generation * fix broken link in docs (#32491) `https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextGenerationPipeline.__call__` `generate_kwargs (dict, optional) — Additional keyword arguments to pass along to the generate method of the model (see the generate method corresponding to your framework here).` link in "here" doesnt work * Docs: alert for the possibility of manipulating logits (#32467) * logits * words * 🌐 [i18n-KO] Translated `gptq.md` to Korean (#32293) * fix: manual edits * fix: manual edits2 * fix: delete files * fix: resolve suggestions Co-authored-by: Sungmin Oh <[email protected]> Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: 김준재 <[email protected]> * fix: resolve suggestions Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Sungmin Oh <[email protected]> Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: 김준재 <[email protected]> Co-authored-by: Steven Liu <[email protected]> * 🌐 [i18n-KO] Translated `prompting.md` to Korean (#32294) * docs: ko: tasks/prompting.md * feat: nmt-draft * fix: update translation in prompting.md * fix: update toctree.yml * fix: manual edits * fix: toctree edits * fix: resolve suggestions Co-authored-by: boyunJang <[email protected]> Co-authored-by: Harheem Kim <[email protected]> Co-authored-by: timdalxx <[email protected]> --------- Co-authored-by: boyunJang <[email protected]> Co-authored-by: Harheem Kim <[email protected]> Co-authored-by: timdalxx <[email protected]> * 🌐 [i18n-KO] Translated `quantization/quanto.md` to Korean (#32281) * docs: ko: quantization/quanto.md * feat: nmt draft * fix: resolve suggestions Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: Minki Kim <[email protected]> Co-authored-by: 김준재 <[email protected]> * fix: resolve suggestions Co-authored-by: SeungYoun Lee <[email protected]> --------- Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: Minki Kim <[email protected]> Co-authored-by: 김준재 <[email protected]> * 🌐 [i18n-KO] Translated `image_feature_extraction.md` to Korean (#32239) * docs: ko: tasks/images_feature_extraction.md * feat: nmt draft * fix: manual edits * fix: manual edits * fix: manual edits * fix: manual edits * feat: manual edits * Update docs/source/ko/tasks/image_feature_extraction.md Co-authored-by: Jihun Lim <[email protected]> * Update docs/source/ko/tasks/image_feature_extraction.md Co-authored-by: Jihun Lim <[email protected]> * fix: manual edits --------- Co-authored-by: Jihun Lim <[email protected]> * Fix references to model google mt5 small (#32497) * Docs: Fixed WhisperModel.forward’s docstring link (#32498) Fixed WhisperModel.forward’s docstring link. * 🌐 [i18n-KO] Translated `chat_templating.md` to Korean (#32362) * docs: ko: chat_templating.md * feat: nmt draft * fix: manual edits * Update docs/source/ko/chat_templating.md Co-authored-by: Sungmin Oh <[email protected]> * Update docs/source/ko/chat_templating.md Co-authored-by: Sungmin Oh <[email protected]> * fix: apply suggestions from code review - anchor Co-authored-by: Sungmin Oh <[email protected]> * fix: manual edits Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: Minki Kim <[email protected]> * fix: manual edits * fix: delete 'default template' section --------- Co-authored-by: Sungmin Oh <[email protected]> Co-authored-by: SeungYoun Lee <[email protected]> Co-authored-by: Minki Kim <[email protected]> * Fix link to autoclass_tutorial.md in i18n.md (#32501) * Fix typo: depracted -> deprecated (#32489) Hello! ## Pull Request overview * Fix typo ## Details This should speak for itself. cc @itazap @ArthurZucker - Tom Aarsen * Fix issue #32518: Update llm_tutorial.md (#32523) Update llm_tutorial.md remove comma re: issue 32518 https://github.com/huggingface/transformers/issues/32518 * Change Phi3 `_supports_sdpa` to True (#32457) * Change `_supports_sdpa` to True * add phi3 to sdpa support list * Uniformize kwargs for processors - GroundingDINO (#31964) * fix typo * uniform kwargs * make style * add comments * remove return_tensors * remove common_kwargs from processor since it propagates * make style * return_token_type_ids to True * revert the default imagekwargs since does not accept any value in the image processro * revert processing_utils.py * make style * add molbap's commit * fix typo * fix common processor * remain * Revert "add molbap's commit" This reverts commit a476c6ee88318ce40d73ea31e2dc2d4faa8ae410. * add unsync PR * revert * make CI happy * nit * import annotationformat * Fix add-new-model-like (#31773) * handle (processor_class, None) returned by ModelPatterns * handle (slow, fast) image processors in add model * handle old image processor case * Add Qwen2-Audio (#32137) * add qwen2audio * Update check_repo.py * fix style * fix test * fix style * add model size * Qwen2AudioEncoderModel->Qwen2AudioEncoder; add copy info * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * switch the attention_mask and the feature_attention_mask * add to PRIVATE_MODELS in check_repo.py; add to MODEL_NAMES_TO_IGNORE in check_table.py * fix initialization * update chat_template * fix consistency issue after copy * add docstrings to _merge_input_ids_with_audio_features * add copied from to prepare_inputs_for_generation * add more details to docs * rm comment * add init_std * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * Update src/transformers/models/qwen2_audio/modeling_qwen2_audio.py Co-authored-by: Yoach Lacombe <[email protected]> * update * Update docs/source/en/model_doc/qwen2_audio.md Co-authored-by: amyeroberts <[email protected]> * update tests * rm ignore_index * update processor * rm ffmpeg_read * Update tests/models/qwen2_audio/test_modeling_qwen2_audio.py Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/qwen2_audio.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/qwen2_audio.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/qwen2_audio.md Co-authored-by: amyeroberts <[email protected]> * update * typo * [run_slow] qwen2_audio * [run_slow] qwen2_audio * [run_slow] qwen2_audio * fix quality * [run_slow] qwen2_audio * [run_slow] qwen2_audio * [run_slow] qwen2_audio * add official model --------- Co-authored-by: Yoach Lacombe <[email protected]> Co-authored-by: amyeroberts <[email protected]> * filter flash_attn optional imports loading remote code (#30954) * filter flash_attn optional imports loading remote code * improve pattern * fix code style * Update src/transformers/dynamic_module_utils.py Co-authored-by: Matt <[email protected]> --------- Co-authored-by: Matt <[email protected]> * 🌐 [i18n-KO] Translated `ko-llm_tutorial_optimization.md` to Korean (#32372) * docs: ko: llm_tutorial_optimization.md * feat: nmt draft * fix: manual edits * Update docs/source/ko/llm_tutorial_optimization.md Co-authored-by: Chaewon Song <[email protected]> * Update docs/source/ko/llm_tutorial_optimization.md Co-authored-by: Chaewon Song <[email protected]> * fix: resolve suggestions - 1 Co-authored-by: Chaewon Song <[email protected]> Co-authored-by: timdalxx <[email protected]> Co-authored-by: boyunJang <[email protected]> * fix: resolve suggestions - 2 Co-authored-by: boyunJang <[email protected]> Co-authored-by: Chaewon Song <[email protected]> Co-authored-by: timdalxx <[email protected]> --------- Co-authored-by: Chaewon Song <[email protected]> Co-authored-by: timdalxx <[email protected]> Co-authored-by: boyunJang <[email protected]> * 🌐 [i18n-KO] Translated `trainer.md` to Korean (#32260) * docs: ko: ko-trainer * feat: nmt draft * fix: manual edits * fix: manual edits * fix: glossary * fix: glossary * Apply suggestions from code review Co-authored-by: Jinuk <[email protected]> Co-authored-by: SeongWooChoi <[email protected]> --------- Co-authored-by: Jinuk <[email protected]> Co-authored-by: SeongWooChoi <[email protected]> * 🌐 [i18n-KO] Translated `eetq.md` to Korean (#32352) * docs: ko: quantization/eetq.md * feat: nmt draft * fix docs: ko: quantization/eetq.md * fix docs: ko: quantization/eetq.md * fix: resolve suggestions Co-authored-by: Jiwook Han <[email protected]> * fix: resolve suggestions * fix: resolve suggsetions --------- Co-authored-by: Jiwook Han <[email protected]> * 🌐 [i18n-KO] Translated `fsdp.md` to Korean (#32261) * docs: ko: fsdp.md * feat: nmt draft * fix: manual edits * Apply suggestions from code review Co-authored-by: 김준재 <[email protected]> Co-authored-by: Minki Kim <[email protected]> * fix: resolve suggestions * Update docs/source/ko/fsdp.md Co-authored-by: 김준재 <[email protected]> * Update docs/source/ko/fsdp.md Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: 김준재 <[email protected]> Co-authored-by: Minki Kim <[email protected]> Co-authored-by: Steven Liu <[email protected]> * 🌐 [i18n-KO] Translated `bitsandbytes.md` to Korean (#32408) * docs: ko: quantization/bitsandbytes.md * feat: nmt draft * fix: minor typos * fix: manual edits * fix: manual edits * fix: resolve suggestions Co-authored-by: wony617 <[email protected]> Co-authored-by: YONGSANG <[email protected]> Co-authored-by: Woojun Jung <[email protected]> * fix: resolve suggestions Co-authored-by: Steven Liu <[email protected]> * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: wony617 <[email protected]> Co-authored-by: YONGSANG <[email protected]> Co-authored-by: Woojun Jung <[email protected]> Co-authored-by: Steven Liu <[email protected]> * Fix generate with `inputs_embeds` as input (#32493) * I think inputs_embeds has ndim == 3 * fix sequence length catch * add generate test * [run-slow]olmo, persimmon, gemma, gemma2, qwen2, llama * skip whisper * fix bart test * more fixes * Fixed test `test_static_cache_exportability` with torch 2.4.0 (#32516) Workaround the export issue in torch 2.4 Co-authored-by: Guang Yang <[email protected]> * Fix code example to load bigcode starcoder2 7b (#32474) * [docs] Translation guide (#32547) clarify * Gemma2: fix FA2 generation (#32553) fix FA2 * Fix a bug in Qwen2Audio (#32552) fix _update_model_kwargs_for_generation * fix slow integration gemma2 test (#32534) no empty revision * fix non contiguous tensor value error in save_pretrained (#32422) Signed-off-by: duzhanwei <[email protected]> Co-authored-by: duzhanwei <[email protected]> * 🌐 [i18n-KO] Translated `agent.md` to Korean (#32351) * docs: ko: main_classes/agent * feat: chatgpt draft * fix: manual edits * �fix: resolve suggestions Co-authored-by: Woojun Jung <[email protected]> Co-authored-by: thsamaji <[email protected]> Co-authored-by: SeungAhSon <[email protected]> * fix: resolve suggestions * fix: resolve code line number --------- Co-authored-by: Woojun Jung <[email protected]> Co-authored-by: thsamaji <[email protected]> Co-authored-by: SeungAhSon <[email protected]> * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <[email protected]> * fix * fix tests --------- Co-authored-by: Arthur <[email protected]> * Fix: FA2 with packed training (#32487) * fix check * add tests * [run-slow] llama, gemma2 * oops, whisper actually runs but needed some special treatment * Fix sliding window attention used in Gemma2FlashAttention2 (#32522) * fix sliding window attention (flash2) in gemma2 model * [run-slow] gemma * fix slicing attention_mask for flash_attn2 * fix slicing attention_mask when flash_attn is used * add missing comment * slice the last seq_len tokens in the key, value states * revert code of slicing key, value states * fix: Fixed conditional check for `encodec` model names (#32581) * Fixed conditional check for encodec model names. * Reformatted conditional check. * Fix `.push_to_hub(..., create_pr=True, revision="my-branch")` when creating PR on not-owned repo (#32094) Fix create_pr aagainst existing revision * Bump aiohttp from 3.9.4 to 3.10.2 in /examples/research_projects/decision_transformer (#32569) Bump aiohttp in /examples/research_projects/decision_transformer Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.9.4 to 3.10.2. - [Release notes](https://github.com/aio-libs/aiohttp/releases) - [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst) - [Commits](https://github.com/aio-libs/aiohttp/compare/v3.9.4...v3.10.2) --- updated-dependencies: - dependency-name: aiohttp dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump torch from 1.13.1 to 2.2.0 in /examples/research_projects/visual_bert (#32220) Bump torch in /examples/research_projects/visual_bert Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0) --- updated-dependencies: - dependency-name: torch dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Cleanup tool calling documentation and rename doc (#32337) * Rename "Templates for Chat Models" doc to "Chat Templates" * Small formatting fix * Small formatting fix * Small formatting fix * Cleanup tool calling docs as well * Remove unneeded 'revision' * Move tip to below main code example * Little bonus section on template editing * 🌐 [i18n-KO] Translated `deepspeed.md` to Korean (#32431) * Update _toctree.yml * docs: ko: deepspeed.md * Apply suggestions from code review Co-authored-by: wony617 <[email protected]> * Apply suggestions from code review Co-authored-by: wony617 <[email protected]> * Update docs/source/ko/_toctree.yml Co-authored-by: Steven Liu <[email protected]> * Update docs/source/ko/deepspeed.md * Update docs/source/ko/deepspeed.md Co-authored-by: SeungAhSon <[email protected]> * Apply suggestions from code review Co-authored-by: wony617 <[email protected]> * Update docs/source/ko/_toctree.yml --------- Co-authored-by: wony617 <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: SeungAhSon <[email protected]> * 🌐 [i18n-KO] Translated `awq.md`to Korean (#32324) * fix: manual edits * Apply suggestions from code review Co-authored-by: SeongWooChoi <[email protected]> Co-authored-by: Chulhwa (Evan) Han <[email protected]> * fix:manual edits - 잘못된 경로에 번역본 파일을 생성해서 옮김 * Delete docs/source/ko/tasks/awq.md * Update docs/source/ko/_toctree.yml Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: SeongWooChoi <[email protected]> Co-authored-by: Chulhwa (Evan) Han <[email protected]> Co-authored-by: Steven Liu <[email protected]> * fix: Fixed failing `test_find_base_model_checkpoint` (#32638) Fixed failing test_find_base_model_checkpoint. * Bump tensorflow from 2.11.1 to 2.12.1 in /examples/research_projects/decision_transformer (#32341) Bump tensorflow in /examples/research_projects/decision_transformer Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.11.1 to 2.12.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](https://github.com/tensorflow/tensorflow/compare/v2.11.1...v2.12.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix: Updated the `is_torch_mps_available()` function to include `min_version` argument (#32545) * Fixed wrong argument in is_torch_mps_available() function call. * Fixed wrong argument in is_torch_mps_available() function call. * sorted the import. * Fixed wrong argument in is_torch_mps_available() function call. * Fixed wrong argument in is_torch_mps_available() function call. * Update src/transformers/utils/import_utils.py Co-authored-by: Arthur <[email protected]> * removed extra space. * Added type hint for the min_version parameter. * Added missing import. --------- Co-authored-by: Arthur <[email protected]> * Expand inputs in processors for VLMs (#30962) * let it be * draft * should not have changed * add warnings * fix & add tests * fix tests * ipnuts embeds cannot be passed with pixels * more updates * paligemma ready! * minor typos * update blip-2 * fix tests & raise error * docstring * add blip2 test * tmp * add image seq length to config * update docstring * delete * fix tests * fix blip * fix paligemma * out-of-place scatter * add llava-next-video * Update src/transformers/models/blip_2/modeling_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * remove tmp * codestyle * nits * more nits * remove overriding in tests * comprehension when merging video * fix-copies * revert changes for embeds test * fix tests after making comprehension * Update src/transformers/models/blip_2/processing_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * Update src/transformers/models/blip_2/processing_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * more updates * fix tests --------- Co-authored-by: Pablo Montalvo <[email protected]> * Automatically add `transformers` tag to the modelcard (#32623) * Automatically add `transformers` tag to the modelcard * Specify library_name and test * Fix tests (#32649) * skip failing tests * [no-filter] * [no-filter] * fix wording catch in FA2 test * [no-filter] * trigger normal CI without filtering * fix tensors on different devices in `WhisperGenerationMixin` (#32316) * fix * enable on xpu * no manual remove * move to device * remove to * add move to * Add support for GrokAdamW optimizer (#32521) * add grokadamw * reformat * code review feedback, unit test * reformat * reformat * Add Depth Anything V2 Metric models (#32126) * add checkpoint and repo names * adapt head to support metric depth estimation * add max_depth output scaling * add expected logits * improve docs * fix docstring * add checkpoint and repo names * adapt head to support metric depth estimation * add max_depth output scaling * add expected logits * improve docs * fix docstring * rename depth_estimation to depth_estimation_type * add integration test * Refactored tests to include metric depth model inference test * Integration test pass when the timm backbone lines are commented (L220-L227) * address feedback * replace model path to use organization path * formatting * delete deprecated TODO * address feedback * [run_slow] depth_anything * Fix: Fixed directory path for utils folder in `test_tokenization_utils.py` (#32601) * Removed un-necessary expressions. * Fixed directory path for utils folder in test_tokenization_utils.py * Modify ProcessorTesterMixin for better generalization (#32637) * Add padding="max_length" to tokenizer kwargs and change crop_size to size for image_processor kwargs * remove crop_size argument in align processor tests to be coherent with base tests * Add pad_token when loading tokenizer if needed, change test override tokenizer kwargs, remove unnecessary test overwrites in grounding dino * TF_Deberta supporting mixed precision (#32618) * Update modeling_tf_deberta.py Corrected some codes which do not support mixed precision * Update modeling_tf_deberta_v2.py Corrected some codes which do not support mixed precision * Update modeling_tf_deberta_v2.py * Update modeling_tf_deberta.py * Add files via upload * Add files via upload * Fix tests recurrent (#32651) * add fix for recurrentgemma * [no-filter] * trigger-ci * [no-filter] * [no-filter] * attempt to fix mysterious zip error * [no-filter] * fix lookup error * [no-filter] * remove summarization hack * [no-filter] * Support MUSA (Moore Threads GPU) backend in transformers (#31913) Add accelerate version check, needs accelerate>=0.33.0 * fix: Fixed failing tests in `tests/utils/test_add_new_model_like.py` (#32678) * Fixed failing tests in tests/utils/test_add_new_model_like.py * Fixed formatting using ruff. * Small nit. * Update translation docs review (#32662) update list of people to tag * Add TorchAOHfQuantizer (#32306) * Add TorchAOHfQuantizer Summary: Enable loading torchao quantized model in huggingface. Test Plan: local test Reviewers: Subscribers: Tasks: Tags: * Fix a few issues * style * Added tests and addressed some comments about dtype conversion * fix torch_dtype warning message * fix tests * style * TorchAOConfig -> TorchAoConfig * enable offload + fix memory with multi-gpu * update torchao version requirement to 0.4.0 * better comments * add torch.compile to torchao README, add perf number link --------- Co-authored-by: Marc Sun <[email protected]> * Fix `JetMoeIntegrationTest` (#32332) JetMoeIntegrationTest Co-authored-by: ydshieh <[email protected]> * Update the distributed CPU training on Kubernetes documentation (#32669) * Update the Kubernetes CPU training example * Add namespace arg Signed-off-by: Dina Suehiro Jones <[email protected]> --------- Signed-off-by: Dina Suehiro Jones <[email protected]> * fix: Fixed unknown pytest config option `doctest_glob` (#32475) Fixed unknown config option doctest_glob. * Unpin deepspeed in Docker image/tests (#32572) Unpin deepspeed * Updated workflows to the latest versions (#32405) Updated few workflows to the latest versions. * reopen: llava-next fails to consider padding_side during Training (#32679) restore #32386 * fix: Corrected ` falcon-mamba-7b` model checkpoint name (#32837) Corrected the model checkpoint. * fix: update doc link for runhouse in README.md (#32664) * VLMs: small clean-up for cache class (#32417) * fix beam search in video llava * [run-slow] video_llava * add back the position ids (#32554) * add back the position ids * fix failing test * Use head_dim if in config for RoPE (#32495) * use head_dim if in config for RoPE * typo * simplify with getattr * Generate: unify `LogitsWarper` and `LogitsProcessor` (#32626) * [tests] make test_sdpa_equivalence device-agnostic (#32520) * fix on xpu * [run_all] * Cache: use `batch_size` instead of `max_batch_size` (#32657) * more precise name * better docstrings * Update src/transformers/cache_utils.py Co-authored-by: Arthur <[email protected]> --------- Co-authored-by: Arthur <[email protected]> * Fix AutoConfig and AutoModel support for Llava-Next-Video (#32844) * Fix: fix all model_type of Llava-Next-Video to llava_next_video * Fix doc for llava_next_video * * Fix formatting issues * Change llava-next-video.md file name into llava_next_video.md to make it compatible with implementation * Fix docs TOC for llava-next-video * improve _get_is_as_tensor_fns (#32596) * improve _get_is_as_tensor_fns * format * Revert PR 32299, flag users when Zero-3 was missed (#32851) Revert PR 32299 * fix multi-gpu with static cache (#32543) * Reduce the error log when using core models that need their weights renamed, and provide a step forward (#32656) * Fin * Modify msg * Finish up nits * Make beam_constraints.Constraint.advance() docstring more accurate (#32674) * Fix beam_constraints.Constraint.advance() docstring * Update src/transformers/generation/beam_constraints.py Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Steven Liu <[email protected]> * generate: missing `to` in DoLa body, causing exceptions in multi-gpu generation (#32856) * Add Flax Dinov2 (#31960) * tfmsenv restored in main * installed flax * forward pass done and all tests passed * make fix-copies and cleaning the scripts * fixup attempt 1 * fixup attempt 2 * fixup third attempt * fixup attempt 4 * fixup attempt 5 * dinov2 doc fixed * FlaxDinov2Model + ForImageClassification added to OBJECTS_TO_IGNORE * external pos_encoding layer removed * fixup attempt 6 * fixed integration test values * fixup attempt 7 * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/dinov2/modeling_flax_dinov2.py Co-authored-by: amyeroberts <[email protected]> * Update src/tran…

gante mentioned this pull request May 13, 2024

tracker: generate compatibility with torch.compile #28981

Open

32 tasks

gante force-pushed the end_to_end_mvp branch from 0320b91 to ca05970 Compare May 14, 2024 09:48

ydshieh changed the title ~~Generate: end-to-end compilation~~ [WIP] Generate: end-to-end compilation May 14, 2024

gante force-pushed the end_to_end_mvp branch 3 times, most recently from 7db06d8 to 9c08cec Compare May 22, 2024 14:38

gante requested review from ArthurZucker and zucchini-nlp May 25, 2024 15:35

gante changed the title ~~[WIP] Generate: end-to-end compilation~~ Generate: end-to-end compilation May 25, 2024

gante commented May 25, 2024

View reviewed changes

src/transformers/models/cohere/modeling_cohere.py Outdated Show resolved Hide resolved

gante mentioned this pull request May 25, 2024

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Closed

zucchini-nlp reviewed May 27, 2024

View reviewed changes

gante force-pushed the end_to_end_mvp branch from 3bea9b6 to 1e087cb Compare May 29, 2024 12:34

gante mentioned this pull request May 29, 2024

DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

Closed

ArthurZucker approved these changes Jun 12, 2024

View reviewed changes

huggingface deleted a comment from github-actions bot Jul 8, 2024

gante force-pushed the end_to_end_mvp branch from 1e087cb to 64b2f29 Compare July 8, 2024 13:36

gante force-pushed the end_to_end_mvp branch from 0f81de3 to e8cbf32 Compare July 9, 2024 13:08

ArthurZucker reviewed Jul 9, 2024

View reviewed changes

src/transformers/models/cohere/modeling_cohere.py Outdated Show resolved Hide resolved

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

gante added 19 commits July 27, 2024 13:53

merged with main

0ebc4c7

nits

86c7170

add todo

b2b6001

final corrections

2fcc207

add docs for generation compilation

e84aedb

docs nits

d2b45a4

add tip

64ce18b

PR suggestions

ef4d419

add more details to the compilation docs

d5e920d

fix cache positions

40482d3

cache is now init in generate; update docs

e3d9c04

tag test as flaky

139e212

docs

bc4ad7d

post rebase make fixup and other nits

54c9eef

remove unintended changes

3186b14

whisper (encoder-decoder) not supported

484d922

move token default updates to ; add tests for token defaults

bf9ef8a

push changes

f2e2833

manual rebase

16f92f4

gante force-pushed the end_to_end_mvp branch from 9c805f0 to 16f92f4 Compare July 27, 2024 14:03

gante added 3 commits July 27, 2024 14:25

chameleon doesn't support this

838ba6a

fix test_static_cache_mha_mqa_gqa (broken in another PR)

795d058

docs: dynamic is better with end-to-end compilation

d2e423b

gante merged commit 7ffe25f into huggingface:main Jul 29, 2024
24 checks passed

gante deleted the end_to_end_mvp branch July 29, 2024 09:52

yiming0416 mentioned this pull request Oct 1, 2024

hf_T5_generate torch.complie() failed with latest transformers==4.44.2 pytorch/pytorch#137133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: end-to-end compilation #30788

Generate: end-to-end compilation #30788

gante commented May 13, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 13, 2024

ydshieh commented May 14, 2024

ydshieh commented May 14, 2024

ydshieh commented May 14, 2024

gante commented May 14, 2024

ydshieh commented May 14, 2024

gante commented May 25, 2024 •

edited

Loading

zucchini-nlp left a comment

gante commented May 29, 2024

ArthurZucker commented Jun 5, 2024

ArthurZucker left a comment

ArthurZucker Jun 12, 2024

ArthurZucker Jun 12, 2024

ydshieh commented Jun 12, 2024

gante commented Jul 9, 2024

ArthurZucker left a comment

gante commented Jul 27, 2024

Generate: end-to-end compilation #30788

Generate: end-to-end compilation #30788

Conversation

gante commented May 13, 2024 • edited Loading

What does this PR do?

Tests

Performance

HuggingFaceDocBuilderDev commented May 13, 2024

ydshieh commented May 14, 2024

ydshieh commented May 14, 2024

ydshieh commented May 14, 2024

gante commented May 14, 2024

ydshieh commented May 14, 2024

gante commented May 25, 2024 • edited Loading

zucchini-nlp left a comment

Choose a reason for hiding this comment

gante commented May 29, 2024

ArthurZucker commented Jun 5, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 12, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 12, 2024

Choose a reason for hiding this comment

ydshieh commented Jun 12, 2024

gante commented Jul 9, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Jul 27, 2024

gante commented May 13, 2024 •

edited

Loading

gante commented May 25, 2024 •

edited

Loading