From f995fdb44d9ca0d5d487c0c6a48a9a24caa7c517 Mon Sep 17 00:00:00 2001 From: Pavel Iakubovskii Date: Tue, 6 Aug 2024 07:59:13 +0000 Subject: [PATCH] Squashed commit of the following: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit 37c5ca5eb9012a1009cf23b892828902f6a8799a Author: Raushan Turganbay Date: Tue Aug 6 10:24:19 2024 +0500 Cache: create docs (#32150) * draft * updates * works? * try adding python example in hidden section * another try * hwo do i render python * format as html code? * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante * Update docs/source/en/kv_cache.md Co-authored-by: Joao Gante * one more small update * should render hidden secrtion now * add outputs * fix links * check links * update all links * update with offloaded cache * all cache is importable, so they appear in docs * fix copies * docstring... --------- Co-authored-by: Joao Gante commit 13dc6b0853c3cb54e79b18105c0528bc9e84881c Author: Francisco Kurucz Date: Mon Aug 5 19:14:50 2024 -0300 Fix documentation links and code reference to model llava-next (#32434) commit 7e5d46ded433605a906fcab6be43ac85307cca9b Author: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Date: Mon Aug 5 16:33:19 2024 +0100 Respect the config's attn_implementation if set (#32383) * Respect the config's attn if set * Update test - can override in from_config * Fix commit 458b0cd2c544cdd6c700f9b0c21077c889bcee6c Author: Sai-Suraj-27 Date: Mon Aug 5 19:49:42 2024 +0530 fix: Updated `test_embeded_special_tokens` for luke and mluke models (#32413) Fixed tokenizertests for luke, mluke models. commit baf7e5c927744122c89ab1270c6c312541c7eb41 Author: Abdi <48970896+AbdiHaryadi@users.noreply.github.com> Date: Mon Aug 5 21:15:36 2024 +0800 Persist embedding type of BART and mBART models after resize (#32242) * fix: persist embedding type of MBartConditonalGeneration after resize * fix: persist embedding type of BartConditonalGeneration after resize commit f5f1e52f6cf13cdf63ff25c311d33e2f2a842911 Author: Francisco Kurucz Date: Mon Aug 5 05:18:28 2024 -0300 Fix documentation references to google/bit-50 model (#32407) commit ea5da52ebc062ff56f0e3aa05b0e3cc981731e14 Author: Nicholas Broad Date: Mon Aug 5 00:51:58 2024 -0700 add values for neftune (#32399) I always forget what typical values are, and I have to look at the paper everytime. This will be a helpful reminder. commit 3d7c2f9dea45338b7ebcd459b452e2fad7abfa1f Author: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Date: Mon Aug 5 09:22:48 2024 +0200 * save total_vocab_size = vocab_size + user added tokens to speed up operation * updating length when added_tokens_decoder is set * add test len(tokenizer) commit 3bb646a54f42030e9bafa47cd3f64367691a3bc5 Author: Raushan Turganbay Date: Mon Aug 5 11:58:42 2024 +0500 Phi3 tests: fix typing for Python 3.8 (#32388) fix phi commit 05ae3a300d6f3534eeb99a08828a5bae6dd973db Author: TechInterMezzo Date: Mon Aug 5 08:40:58 2024 +0200 fix: SeamlessM4TFeatureExtractor stride remainder (#32088) * fix: SeamlessM4TFeatureExtractor stride remainder * Added attention mask size test * Reran ruff for style correction commit 847bb856d55e3664150e408448fa59d0705b4d60 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 5 08:38:34 2024 +0200 Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer (#32393) Bump keras in /examples/research_projects/decision_transformer Bumps [keras](https://github.com/keras-team/keras) from 2.8.0 to 2.13.1. - [Release notes](https://github.com/keras-team/keras/releases) - [Commits](https://github.com/keras-team/keras/compare/v2.8.0...v2.13.1) --- updated-dependencies: - dependency-name: keras dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 621fb3c0edddf98f3272f3b197e772af4fa30b6c Author: Xueshen Liu Date: Sat Aug 3 14:07:55 2024 -0400 MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. (#31500) * Mixtral: remove unnecessary plus 1 when calculating rotary_seq_len, allowing position_ids=None (no auto position_ids generation could be unsafe) * fix typo [:-1] to [:, -1] * to meet formatting requirement * to meet formatting requirement * remove white space * MixtralFlashAttention2: put "+ 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. Fix format/style issue. * propagate to startcoder2, phi3, mixtral and qwen2 * update qwen2_moe commit 7c31d05b59a9dce24b8ddc4b2bb8c8cf6bb5fd77 Author: Shaopeng Fu Date: Sat Aug 3 19:24:11 2024 +0300 fix: (issue #32124) Exception raised when running `transformers/examples/flax/language-modeling/t5_tokenizer_model.py`. (#32157) fix: Exception raised when running . commit c1aa0edb48217f416f4bbe6e3a9db1500284513b Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Fri Aug 2 17:32:50 2024 +0800 [generate] only require an attention mask for mps with torch<2.4 (#32367) * up * style * stopping commit 083e13b7c47f674b11c74d1b7c7ee7cd1241b406 Author: Joao Gante Date: Fri Aug 2 09:39:45 2024 +0100 RoPE: Add numerical tests ✨ (#32380) tests! :D commit 2af199c42b545f6248475ce456dd6c2a351b8522 Author: Raushan Turganbay Date: Fri Aug 2 09:54:16 2024 +0500 Update docs (#32368) nits commit 82efc53513a51660e629c7eca8210af1d67df00b Author: Zach Mueller Date: Thu Aug 1 15:18:43 2024 -0400 Yell at the user if zero-3 init wasn't performed, but expected to have been done (#32299) * Test this zach * Test for improper init w/o zero3 * Move back * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Get rid of stars in warning * Make private * Make clear --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 51ab25e2932da15511ced35bcbdfa92d25c4794c Author: OsamaS99 <62110783+OsamaS99@users.noreply.github.com> Date: Thu Aug 1 14:57:42 2024 +0200 Fixed Hybrid Cache Shape Initialization. (#32163) * fixed hybrid cache init, added test * Fix Test Typo --------- Co-authored-by: Aaron Haag commit e3d8285a84f803e962050e2c2283f3362e36bfbc Author: Joao Gante Date: Thu Aug 1 13:46:11 2024 +0100 Docker: add `speech` dep to the consistency docker image (#32374) commit ca59d6f77c9fda197222f9aa9205d8c7b5dff34e Author: Nikos Karampatziakis Date: Thu Aug 1 05:42:07 2024 -0700 Offloaded KV Cache (#31325) * Initial implementation of OffloadedCache * enable usage via cache_implementation * Address feedback, add tests, remove legacy methods. * Remove flash-attn, discover synchronization bugs, fix bugs * Prevent usage in CPU only mode * Add a section about offloaded KV cache to the docs * Fix typos in docs * Clarifications and better explanation of streams commit b4727a1216bb21df2795e973063ed07202235d7e Author: Omar Salman Date: Thu Aug 1 17:32:13 2024 +0500 Fix conflicting key in init kwargs in PreTrainedTokenizerBase (#31233) * Fix conflicting key in init kwargs in PreTrainedTokenizerBase * Update code to check for callable key in save_pretrained * Apply PR suggestions * Invoke CI * Updates based on PR suggestion commit db8c7caeb6b3969a2153b36ba3e5fdef6534c1d6 Author: Viktor Scherbakov Date: Thu Aug 1 14:30:10 2024 +0200 Empty list in defaults for LLaMA special tokens during weights conversion (#32342) empty list in defaults commit 2229ebe7220fb54bc5f91f575c2d7a988e7122cb Author: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Date: Thu Aug 1 13:57:41 2024 +0200 update clean_up_tokenization_spaces warning (#32371) commit 05c1f9af9a5ebd213dd923e97f6fbed4c115f3c6 Author: Hanna Yukhymenko <49597980+ayukh@users.noreply.github.com> Date: Thu Aug 1 13:52:05 2024 +0200 Check device map for saving tokenizer config on TPU (fix for issue #31971) (#32043) * Remove TPU device map for saving tokenizer config * Update tokenization_utils_base.py * Fix error msg when passing non-string device into tokenizer * Fix error message for non-string tokenizer device * Print out tokenizer device type in error msg * Update tokenization_utils_base.py commit 9e2828403218da16d9759c9be020b70f51df373d Author: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Date: Thu Aug 1 19:51:20 2024 +0800 add missing attribute _supports_param_buffer_assignment for gpt-j. (#32359) Co-authored-by: Guoming Zhang <37257613+nv-guomingz@users.noreply.github.com> commit 48ed24c50ab29bf690f2ab030721e6a8b0aa5205 Author: Lunwen He Date: Thu Aug 1 04:49:00 2024 -0700 Remove size check between attn_weights and kv_seq_len for phi3 (#32339) * Remove size check between attn_weights and kv_seq_len * add unit tests commit e234061cddd28bb8b82144833241883816289e40 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Thu Aug 1 18:10:56 2024 +0800 [whisper] compile compatibility with long-form decoding (#31772) * [whisper] compile compatibility with long-form decoding * clarify comment * fix after rebase * finalise * fix bsz * fix cache split * remove contiguous * style * finish * update doc * prevent cuda graph trace commit 9451a385261b30e7319a2c93285ab76161e8c003 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Thu Aug 1 16:05:27 2024 +0800 [enc-dec cache] fix bug in indexing (#32370) commit 453e74884fb7e2613e7b45033fbb3c1cadb638b4 Author: Raushan Turganbay Date: Thu Aug 1 09:48:03 2024 +0500 LLaVa: add cache class attribute (#32278) cache class flag commit 14ee2326e51cb210cec72f31b248cb722e9d5d1f Author: Ricardo Date: Thu Aug 1 06:34:22 2024 +0800 fix: warmup_steps check for training_args (#32236) commit 53f0c9c2906e0b0f1623bfdfb420fca1e655098d Author: Sai-Suraj-27 Date: Thu Aug 1 01:26:50 2024 +0530 fix: Removed unnecessary `@staticmethod` decorator (#32361) * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. * Fixed staticmethods with self as first argument. commit 92abe6033491dcaa958235e551f40f6b417d3771 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Wed Jul 31 20:03:07 2024 +0200 >3-5x faster torch.compile forward compilation for autoregressive decoder models (#32227) * draft * apply changes to all relevant archs * rerun ci - check_docstrings.py failing? * fix docstring * move 2D->4D mask creation to modeling file * repo consistency * fix the batch size = 1 case - calling contiguous is not enough * nit * style * propagate to gemma/gemma-2 * prepare inputs for gemma generation * implement test and tiny fix in gemma2 * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix copies * ci pass * fix gemma's test_compile_static_cache tests * flacky * retrigger ci --------- Co-authored-by: sanchit-gandhi Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> commit b46bd8b9d2ac991c0c04674957ebc0a65fb3f42b Author: Aymeric Roucher <69208727+aymeric-roucher@users.noreply.github.com> Date: Wed Jul 31 18:44:53 2024 +0200 Fix error when streaming to gradio with non-string tool arguments (#32360) Fix error when streaming agent run to gradio with non-string tool arguments commit ef177a5e1cdf0ca53e24e6d76e813198f7300dc4 Author: Joao Gante Date: Wed Jul 31 16:04:48 2024 +0100 Gemma 2: support assisted generation (#32357) commit 5f1fcc299cb00c1edce5eb1efb8bacdde2365690 Author: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Date: Wed Jul 31 14:51:04 2024 +0100 [Idefics2] - Fix FA2 call for Perceiver layer (#32275) * Fix FA2 call for Perciever layer * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 * Fix up * [run_slow] idefics2 * [run_slow] idefics2 * [run_slow] idefics2 commit b75ad56620431984a44a962c98136c8571b4fca9 Author: Joao Gante Date: Wed Jul 31 11:12:46 2024 +0100 Llama 3.1: Fix incorrect `inv_freq` assignment (#32330) fix 💩 commit 7f552e28e0aca00ce60868c7620f7463eab60e14 Author: Raushan Turganbay Date: Wed Jul 31 10:33:38 2024 +0500 Gemma2 and flash-attention (#32188) * enable flash-attn & static cache * this works, not the prev * fix for sliding window layers * not needed anymore commit a3264332cfb5ab8675ddb42740a75aeee1782a74 Author: Raushan Turganbay Date: Wed Jul 31 10:01:12 2024 +0500 LLaVA-NeXT: fix anyres shapes (#32314) fix commit 6e2d04e429dc4ce240c99bd14b7b84550b79fd73 Author: Joshua Lochner Date: Tue Jul 30 23:36:38 2024 +0200 Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process (#32191) * Remove user-defined tokens which can be obtained through merges * Remove debug line * formatting * Refactor spm slow -> fast converter * revert unnecessary refactor * set comprehension * remove test files * Use `vocab_scores` * Always replace spiece underline with space in decode * we no longer need token filtering * Add save fast load slow unit test * Remove tokenizers version check * Remove duplicate code * Make `` and `` special tokens * Bias merge priority with length if score is the same * Add unit test for merge priority * CI commit 026a173a64372e9602a16523b8fae9de4b0ff428 Author: Joao Gante Date: Tue Jul 30 18:56:10 2024 +0100 Repo checks: skip docstring checks if not in the diff (#32328) * tmp * skip files not in the diff * use git.Repo instead of an external subprocess * add tiny change to confirm that the diff is working on pushed changes * add make quality task * more profesh main commit reference commit 516af4bb63538edc448f814e3690dd5171c4f311 Author: fkrasnov2 Date: Tue Jul 30 20:21:45 2024 +0300 fixes #32329 : The Torch code is correct - to get an average of 10% o… (#32335) fixes #32329 : The Torch code is correct - to get an average of 10% of the total, we want to take 50% of the remainder after we've already masked 80% with [MASK] in the previous step. commit 62c60a30181a65e1a3a7f19c3055a240a6a21335 Author: Wing Lian Date: Tue Jul 30 12:55:59 2024 -0400 fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit (#32276) commit 16271080333ad52be5349fb31d789fb232b68760 Author: Sai-Suraj-27 Date: Tue Jul 30 22:23:03 2024 +0530 fix: Added missing raise keyword for few exceptions (#32333) Fixed raising of few exceptions. commit bd54ed2ed7f578e4122f3e6d536fbe3c9bc76de1 Author: plaggy <35706832+plaggy@users.noreply.github.com> Date: Tue Jul 30 18:48:18 2024 +0200 Alternative agent plan (#32295) * new agent plan * plan type assertion * style corrections * better prompt naming * make fixup commit e68ec18ce224af879f22d904c7505a765fb77de3 Author: Joao Gante Date: Tue Jul 30 15:49:14 2024 +0100 Docs: formatting nits (#32247) * doc formatting nits * ignore non-autodocs * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/esm/modeling_esm.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * make fixup --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 2fbbcf5007509c66b02924ce6dcff66f58e7f58c Author: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> Date: Tue Jul 30 16:00:13 2024 +0200 Fix M4T for ASR pipeline (#32296) * tentative fix * do the same for M4T commit 084b5094eb490319719cc11cb05b751e0b419d49 Author: Luc Georges Date: Tue Jul 30 14:49:26 2024 +0200 feat(ci): set `fetch-depth: 0` in trufflehog checkout step (#31663) commit 20528f067cf9204cea5178ce0f837245e146e159 Author: Teddy Ferdinan <64476430+teddy-f-47@users.noreply.github.com> Date: Tue Jul 30 11:25:54 2024 +0200 Cast epochs_trained to int when resuming training (#32286) * fix epochs_trained as int when resuming training * refactor --------- Co-authored-by: teddyferdinan commit 934fe1504e6d5e87e01d96305f4d97faa63cf4c1 Author: Isotr0py <2037008807@qq.com> Date: Tue Jul 30 17:01:00 2024 +0800 Fix GGUF dequantize for `gguf==0.9.1` (#32298) * fix gguf dequantize for gguf==0.9.1 * fix old version * make style commit 3e8106d2533cbd890ddd1e919bd62132cd4718c3 Author: Gilad Turok <36947659+gil2rok@users.noreply.github.com> Date: Tue Jul 30 03:19:24 2024 -0400 Docs: fix GaLore optimizer code example (#32249) Docs: fix GaLore optimizer example Fix incorrect usage of GaLore optimizer in Transformers trainer code example. The GaLore optimizer uses low-rank gradient updates to reduce memory usage. GaLore is quite popular and is implemented by the authors in [https://github.com/jiaweizzhao/GaLore](https://github.com/jiaweizzhao/GaLore). A few months ago GaLore was added to the HuggingFace Transformers library in https://github.com/huggingface/transformers/pull/29588. Documentation of the Trainer module includes a few code examples of how to use GaLore. However, the `optim_targe_modules` argument to the `TrainingArguments` function is incorrect, as discussed in https://github.com/huggingface/transformers/pull/29588#issuecomment-2006289512. This pull request fixes this issue. commit f0bc49e7f61f74f055c47ad40e6010f57eed0b0b Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Mon Jul 29 22:12:21 2024 +0200 use torch 2.4 in 2 CI jobs (#32302) Co-authored-by: ydshieh commit a24a9a66f446dcb9277e31d16255536c5ce27aa6 Author: Aymeric Roucher <69208727+aymeric-roucher@users.noreply.github.com> Date: Mon Jul 29 20:12:44 2024 +0200 Add stream messages from agent run for gradio chatbot (#32142) * Add stream_to_gradio method for running agent in gradio demo commit 811a9caa2141bc98f96b36c69abcf1f934bd1fd2 Author: Guang Yang <42389959+guangy10@users.noreply.github.com> Date: Mon Jul 29 10:19:15 2024 -0700 Make static cache compatible with torch.export (#32168) commit 7f5d644e69068825bb5b6e84cdc56b3d3a9bd04f Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Mon Jul 29 21:24:42 2024 +0800 [pipeline] fix padding for 1-d tensors (#31776) * [pipeline] fix padding for 1-d tensors * add test * make style * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py Co-authored-by: Kamil Akesbi <45195979+kamilakesbi@users.noreply.github.com> * Update tests/pipelines/test_pipelines_automatic_speech_recognition.py --------- Co-authored-by: Kamil Akesbi <45195979+kamilakesbi@users.noreply.github.com> commit 3fbaaaa64d1ef3d8327adb577994d3d11277c77a Author: Kamil Akesbi <45195979+kamilakesbi@users.noreply.github.com> Date: Mon Jul 29 11:19:52 2024 +0100 Whisper tokenizer word level timestamps (#32197) * fix _fix_key in PreTrainedModel * fix _find_longest_common_sequence * add test * remove result.json * nit * update test commit 7ffe25f2b935dcaf65079b04c5f91c8a42a99e28 Author: Joao Gante Date: Mon Jul 29 10:52:13 2024 +0100 Generate: end-to-end compilation (#30788) * mvp * added test (a few models need fixes) * fix a few test cases * test nits * harder test 😈 * revert changes in stablelm * test with improved condition * add todo * tmp commit * merged with main * nits * add todo * final corrections * add docs for generation compilation * docs nits * add tip * PR suggestions * add more details to the compilation docs * fix cache positions * cache is now init in generate; update docs * tag test as flaky * docs * post rebase make fixup and other nits * remove unintended changes * whisper (encoder-decoder) not supported * move token default updates to ; add tests for token defaults * push changes * manual rebase * chameleon doesn't support this * fix test_static_cache_mha_mqa_gqa (broken in another PR) * docs: dynamic is better with end-to-end compilation commit 49928892d6491ff5a49c12cbc34695f6fa7ac0ed Author: Sai-Suraj-27 Date: Mon Jul 29 15:20:43 2024 +0530 fix(docs): Fixed a link in docs (#32274) Fixed a link in docs. commit 6494479f1de9fe16e9c6f89e52eb0cf81f864a7c Author: Fanli Lin Date: Mon Jul 29 17:29:11 2024 +0800 make `p_mask` a numpy array before passing to `select_starts_ends` (#32076) * fix * bug fix * refine * fix commit 535fe78b9f1d148684723e51f00645351880c47a Author: Joao Gante Date: Mon Jul 29 10:06:05 2024 +0100 Repo: remove exceptions in `check_docstrings` (#32259) remove exceptions commit a2ad9d5ad53f68c1ad268f7f46538eac6f5b631b Author: Sai-Suraj-27 Date: Mon Jul 29 14:13:09 2024 +0530 fix: Fixed wrong argument passed to `convert_blip_checkpoint` function call (#32262) Removed one wrong argument passed to convert_blip_checkpoint function call. commit 5019aabfacf7599b9a6b4e7a1adc1fb5c9017727 Author: leejet Date: Mon Jul 29 15:51:43 2024 +0800 Optimize t5 tokenize logic to avoid redundant calls (#32270) * Optimize t5 tokenize logic to avoid redundant calls * fix and overwrite copies commit f2122cc6eb8e50e4d1b45da54b43bba59a458b30 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Mon Jul 29 09:42:54 2024 +0200 Upload new model failure report to Hub (#32264) upload Co-authored-by: ydshieh commit f7396876849926afa87c9412d67c43618dad403d Author: Raushan Turganbay Date: Mon Jul 29 10:58:59 2024 +0500 🚨 Bloom support for cache class (#31445) * bloom dynamic cache * bloom follows standard cache format * no skips for bloom anymore * use cache position when possible * clean up * codestyle * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/bloom/modeling_bloom.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * pr comments * isinstance fix * address comments * make musicgen test happy * [run-slow] bloom --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 44f6fdd74f84744b159fa919474fd3108311a906 Author: Joao Gante Date: Sat Jul 27 10:19:46 2024 +0100 Llama 3.1: replace for loop by tensor ops at inv_freq initialization (#32244) * replace for loop by tensor ops * rm assert; readability commit 8da90687308a10b33c5553b8a506cc04aab31702 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Fri Jul 26 20:52:45 2024 +0200 More flexible trigger condition (#32251) update Co-authored-by: ydshieh commit 81233c069c166af033794134bd8888783ac49ebe Author: Raushan Turganbay Date: Fri Jul 26 14:45:55 2024 +0500 Flash-Attn: fix generation when no attention mask or no pading (#32241) * fix * fix prev test (half of failures) * [run-slow] llama, gemma2 * [run-slow] llama, gemma2 commit 27c7f971c0dcd3bb423ea221fe2bce751d313119 Author: Fanli Lin Date: Fri Jul 26 17:41:27 2024 +0800 [tests] fix `static` cache implementation is not compatible with `attn_implementation==flash_attention_2` (#32039) * add flash attention check * fix * fix commit 5f841c74b62754f186a8c06a684d491524b7bc03 Author: Connor Anderson Date: Fri Jul 26 05:05:46 2024 -0400 Add check for `target_sizes is None` in `post_process_image_guided_detection` for owlv2 (#31934) * Add check for target_sizes is None in post_process_image_guided_detection * Make sure Owlvit and Owlv2 in sync * Fix incorrect indentation; add check for correct size of target_sizes commit f9756d9edb23354e3df50f7eb3f6b3129a25e453 Author: Rohit Dwivedula <25080952+rohitdwivedula@users.noreply.github.com> Date: Fri Jul 26 04:05:38 2024 -0500 Adds: extra_repr for RMSNorm layers in most models (#32204) * adds: extra_repr() to RMSNorm layers in multiple models * adds: extra_repr for deprecated models as well * formatting as per style guide commit b8e5cd5396f7c0cc2d5e10be6696ea38742abf51 Author: Sai-Suraj-27 Date: Fri Jul 26 14:03:02 2024 +0530 Refactor: Removed un-necessary `object` base class (#32230) * Refactored to remove un-necessary object base class. * small fix. commit 1c7ebf1d6eaf0ed0fd4101fd6eb7e64601429cfe Author: João Nadkarni <38245862+joaonadkarni@users.noreply.github.com> Date: Fri Jul 26 09:38:59 2024 +0200 don't log base model architecture in wandb if log model is false (#32143) * don't log base model architecture in wandb is log model is false * Update src/transformers/integrations/integration_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * convert log model setting into an enum * fix formatting --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit c46edfb8230bcc3152e8338742dc4822289acb3d Author: Raushan Turganbay Date: Fri Jul 26 10:52:06 2024 +0500 Resize embeds with DeepSpeed (#32214) * fix resize when deepspeed * deepsped uses new embeds * we needed this commit fad15fba78e4603cd20695757ad899a6687485f9 Author: Raushan Turganbay Date: Fri Jul 26 10:17:27 2024 +0500 Llava: generate without images (#32183) * llava w/o images * tests commit 4ab33c2d81866d4dd2f29df07f1a35491acbb39b Author: Raushan Turganbay Date: Fri Jul 26 10:16:06 2024 +0500 Generation: stop at `eos` for assisted decoding (#31301) * fix * move changes to prompt lookup * add test * set eos in assistant model * style * fix flakiness * changes for new `main` * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add comment to explain --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 9d6c0641c4a3c2c5ecf4d49d7609edd5b745d9bc Author: Pavel Iakubovskii Date: Thu Jul 25 19:20:47 2024 +0100 Fix code snippet for Grounding DINO (#32229) Fix code snippet for grounding-dino commit 3a83ec48a63a8298c8193be48cf00785674bfb70 Author: jrhe <4038905+jrhe@users.noreply.github.com> Date: Thu Jul 25 17:16:13 2024 +0100 Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac (#31846) * use currently active microphone on mac for ffmpeg_microphone * Allow ffmpeg_microphone device to be specified Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 6ed0bf1e8543a7d8e6640bbf9a655c5e1401f7de Author: Huazhong Ji Date: Fri Jul 26 00:01:06 2024 +0800 translate philosophy.md to chinese (#32177) * translate philosophy.md to chinese * add the missing link commit df6eee9201e4ba2b80cea021a18e95ada26ca2cc Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Thu Jul 25 16:12:23 2024 +0200 Follow up for #31973 (#32025) * fix * [test_all] trigger full CI --------- Co-authored-by: ydshieh commit de2318894e4f971ea2273c653a702dc93db2bd6a Author: Kashif Rasul Date: Thu Jul 25 15:12:23 2024 +0200 [warnings] fix E721 warnings (#32223) fix E721 warnings commit 9b9a54e61bf8749588178b37c23d77b90679fd10 Author: Kashif Rasul Date: Thu Jul 25 15:11:43 2024 +0200 [BigBird Pegasus] set _supports_param_buffer_assignment to False (#32222) set _supports_param_buffer_assignment to False commit 1ecedf1d9ee927bac5b5bae8cb1892d936a5b622 Author: Austin <31086824+avlewis@users.noreply.github.com> Date: Thu Jul 25 07:20:27 2024 -0500 Update question_answering.py (#32208) commit f53a5dec7b03eb195dc89c82ae761b033db1ceb6 Author: Huazhong Ji Date: Thu Jul 25 17:04:04 2024 +0800 remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 (#32210) remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 commit 5658e749adbaaf883caec003cecae8ce0a4261a6 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Thu Jul 25 16:58:02 2024 +0800 [whisper] fix short-form output type (#32178) * [whisper] fix short-form output type * add test * make style * update long-form tests * fixes * last fix * finalise test commit 85a1269e19af022e04bc2aad82572cd5a9e8cdd9 Author: Sai-Suraj-27 Date: Wed Jul 24 22:30:21 2024 +0530 fix: Replaced deprecated `unittest method` with the correct one (#32198) Replaced deprecated unittest method with the correct one. commit edd68f4ed8db241bd3e9dc6c4ed96d471f243c9a Author: Matt Date: Wed Jul 24 17:36:32 2024 +0100 :rotating_light: No more default chat templates (#31733) * No more default chat templates * Add the template to the GPT-SW3 tests since it's not available by default now * Fix GPT2 test * Fix Bloom test * Fix Bloom test * Remove default templates again commit 1c122a46dc3c4448901f8d2f3018d9d58b846ba5 Author: Penut Chen <94501378+PenutChen@users.noreply.github.com> Date: Wed Jul 24 23:59:59 2024 +0800 Support dequantizing GGUF FP16 format (#31783) * support gguf fp16 * support gguf bf16 with pytorch * add gguf f16 test * remove bf16 commit af0e4b7b37b2d7eefe7531cf5201a5d6bae85525 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Wed Jul 24 17:14:05 2024 +0200 Fix float8_e4m3fn in modeling_utils (#32193) * Fix float8_e4m3fn in modeling_utils * style * fix * comment commit 1392a6867f40a55dfabaf306745c67627598b1af Author: Raushan Turganbay Date: Wed Jul 24 19:26:20 2024 +0500 Fix resize embedding with Deepspeed (#32192) fix resize when deepspeed commit 8d2534c4d0ab94a97a72d2ce6bb9ccd201abadb3 Author: Arthur <48595927+ArthurZucker@users.noreply.github.com> Date: Wed Jul 24 16:06:39 2024 +0200 let's not warn when someone is running a forward (#32176) * let's not warn when someone is running a foward without cache + self.training * more models * fixup commit e0182f3bd7f4753c1e378e052ceea67898d97359 Author: Joao Gante Date: Wed Jul 24 15:00:48 2024 +0100 RoPE: relaxed rope validation (#32182) * relaxed rope check * lets also accept rope_type=None, defaulting to the original implementation * type and rope_type can coexist commit 165116bc145dcc186fa287e624b28a9ab3a79955 Author: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Date: Wed Jul 24 14:03:40 2024 +0100 Remove conversational pipeline tests (#32099) Remove conversation pipeline tests commit 5f4ee98a7ade33e1c54fdd6181d04ee7b426b392 Author: Dr. Artificial曾小健 <875100501@qq.com> Date: Wed Jul 24 18:54:41 2024 +0800 Update qwen2.md (#32108) * Update qwen2.md outdated description * Update qwen2.md amended * Update qwen2.md Update * Update qwen2.md fix wrong version code, now good to go commit 8678879f1dc2578cec18232146bf19de97aecaa1 Author: 조준래 Date: Wed Jul 24 19:38:49 2024 +0900 fix: default value reflects the runtime environment variables rather than the ones present at import time. (#32153) * fix: default value reflects the runtime environment variables rather than the ones present at import time. * Fix: Change `deterministic` to None by default; use env var if None commit 01be5b48790f113b7d71943b580c842e3e097988 Author: Rohit Dwivedula <25080952+rohitdwivedula@users.noreply.github.com> Date: Wed Jul 24 02:09:59 2024 -0500 adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer (#32171) * adds: extra_repr() to MambaRMSNorm to include the hidden size of the layer * style fix with ruff: commit c85510f958e6955d88ea1bafb4f320074bfbd0c1 Author: Fanli Lin Date: Wed Jul 24 00:47:51 2024 +0800 [docs] change temperature to a positive value (#32077) fix commit bc2adb0112b6677b0dfb4105c74570a0f92183eb Author: Sai-Suraj-27 Date: Tue Jul 23 21:22:41 2024 +0530 fix: Fixed an if condition that is always evaluating to true (#32160) Fixed an if condition always evaluating to true. commit 23f6a43f82fb2980f4b30cf3f95eb3a940384895 Author: Joao Gante Date: Tue Jul 23 16:48:16 2024 +0100 fix (#32162) commit d5a99dfcee6e94065cb7c83cc8ab6fc5daa0cc4e Author: Lysandre Date: Tue Jul 23 16:58:17 2024 +0200 Llama 3.1 conversion Co-authored-by: Arthur Zucker commit ff0d708fe627d6715f9a3e97d0a7947f70437447 Author: Lysandre Date: Tue Jul 23 17:12:47 2024 +0200 Dev version: v4.44.0.dev0 commit d2c687b3f1859b5c61258af14abba5312c0e6201 Author: Sai-Suraj-27 Date: Tue Jul 23 20:37:31 2024 +0530 Updated `ruff` to the latest version (#31926) * Updated ruff version and fixed the required code accorindg to the latest version. * Updated ruff version and fixed the required code accorindg to the latest version. * Added noqa directive to ignore 1 error shown by ruff commit 9cf4f2aa9a9cecbb22e813931ef3bb72fc773540 Author: RhuiDih <166782544+RhuiDih@users.noreply.github.com> Date: Tue Jul 23 21:56:41 2024 +0800 Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629) * add DataCollatorBatchFlattening * Update data_collator.py * change name * new FA2 flow if position_ids is provided * add comments * minor fix * minor fix data collator * add test cases for models * add test case for data collator * remove extra code * formating for ruff check and check_repo.py * ruff format ruff format tests src utils * custom_init_isort.py commit 7d92009af647167bae338e9d4af8bc0452c62fbf Author: Deep Gandhi <97520292+DeF0017@users.noreply.github.com> Date: Tue Jul 23 19:11:52 2024 +0530 Added additional kwarg for successful running of optuna hyperparameter search (#31924) Update integration_utils.py Added additional kwarg commit 63700628adb91600c84fe3bbbc4c667cd3e3aa71 Author: Alvaro Moran <6949769+tengomucho@users.noreply.github.com> Date: Tue Jul 23 14:18:19 2024 +0200 feat(cache): StaticCache uses index_copy_ to avoid useless copy (#31857) * feat(cache): StaticCache uses index_copy_ to avoid useless copy Using index_copy_ allows for explicit in-place change of the tensor. Some backends (XLA) will otherwise copy the tensor, making the code slower and using more memory. Proposed implementation will end up using less memory and on XLA will result in less compilation, but the change is also quite generic, making no change whatsoever on CUDA or CPU backend. * feat(cache): SlidingWindowCache uses index_copy_ to avoid useless copy Applying the same change done in StaticCache. * fix(cache): fallback of index_copy_ when not implemented * fix(cache): in index_copy_ ensure tensors are on same device * [run slow] llama * fix(cache): add move of cache_position to same device in SlidingWindowCache * Revert "[run slow] llama" This reverts commit 02608dd14253ccd464e31c108e0cd94364f0e8b9. commit a009fbdab32a4b068c24052a4dfe7a7bc0fc89f9 Author: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Date: Tue Jul 23 12:23:34 2024 +0100 Fix typing to be compatible with later py versions (#32155) commit 3263b3435473cbb5dc66925bc29c1d32b5b8d431 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Tue Jul 23 18:34:30 2024 +0800 Revert "Incorrect Whisper long-form decoding timestamps " (#32148) Revert "Incorrect Whisper long-form decoding timestamps (#32003)" This reverts commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8. commit 034b47784765e37ecc20f7ad43640f1a2c0094fd Author: Amit Garg Date: Tue Jul 23 03:33:22 2024 -0700 Rename Phi-3 rope scaling type (#31436) * renamed phi3 rope_scaling type * fixed trailing whitespaces * fixed test * added warning * fixed format commit bab32d6fe932a3372fbd6d5a84e3cacb12a61ae0 Author: Alexandre TL Date: Tue Jul 23 12:32:19 2024 +0200 Added mamba.py backend (#30139) * Update README.md * tests: forward ok * backward test done * done testing * removed check. scripts * Update README.md * added use_mambapy arg * fixed typo in warning * protected imports w/ mambapy package * delete pscan.py + raise rather than assert * Update import_utils.py * fix whitespaces and unused import * trailing whitespace + import block unformatted * Update modeling_mamba.py * transpose before pscan * shape comment * ran make style * use_mambapy=False by default Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * ran make fix-copies --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> commit 9ced33ca7f909d9ace743dac083daba99c904d46 Author: Merve Noyan Date: Tue Jul 23 13:23:23 2024 +0300 Fix video batching to videollava (#32139) --------- Co-authored-by: Merve Noyan commit a5b226ce9811aa6b31af0bc9c09c54493a4e67c1 Author: Cyril Vallez Date: Tue Jul 23 12:21:23 2024 +0200 Fix flash attention speed issue (#32028) Add the lru_cache for speed commit a1844a3209eb7e75582684809203bc189931a90c Author: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Date: Tue Jul 23 11:45:54 2024 +0200 gguf conversion add_prefix_space=None for llama3 (#31937) * gguf conversion forces add_prefix_space=False for llama3, this is not required and forces from_slow, which fails. changing to None + test * typo * clean test commit 2e113422b3504fe6de821bb9911b24273b11aa9c Author: Joao Gante Date: Tue Jul 23 10:42:55 2024 +0100 Llama: RoPE refactor (#32135) Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> commit 5a4a76edb7ac6bbc764392e89adc11adda91f3e5 Author: bayllama <142558246+bayllama@users.noreply.github.com> Date: Tue Jul 23 02:28:44 2024 -0700 Modify resize_token_embeddings to ensure output type is same as input (#31979) * Change resize_token_embeddings to make it return same Class that is passed to it * Add explanatory comment as requested in review * Add explanatory comments for add resizing function in lxmert * Add comment for padding_idx and moving _resize_bias in lxmert to LxmertForPreTraining --------- Co-authored-by: Prashanth Sateesh Co-authored-by: Prashanth Sateesh commit 1535a2c93d325e529dc9a1907f99247fdf8a58e7 Author: Daniel Lok Date: Tue Jul 23 17:26:00 2024 +0800 Disable quick init for TapasPreTrainedModel (#32149) add attribute to model Signed-off-by: Daniel Lok commit 34b43211d782c00da6fef778dbfaff69bbf3f115 Author: mig-mfreitas <132093787+mig-mfreitas@users.noreply.github.com> Date: Tue Jul 23 10:07:58 2024 +0100 Add YaRN and Dynamic-YaRN RoPE Scaling Methods (#30910) * Add YaRN and Dynamic-YaRN RoPE Scaling Methods YaRN (Yet another RoPE extension method) combines the NTK-By-Parts Interpolation and Attention Scaling methods, improving upon existing RoPE interpolation methods for longer context window sizes. Fine-tuned models maintain their original performance across benchmarks while enabling efficient extrapolation and transfer learning for quicker convergence, especially in compute-limited environments. We implement YaRN and Dynamic-YaRN for the following list of models: - LLaMA - Falcon - GPT-NeoX - Olmo - Persimmon - Phi - StableLM - OpenLLaMA New unit tests are added to assert YaRN's correct behavior on both short and long sequence inputs. For more details, please refer to https://arxiv.org/abs/2309.00071. Co-authored-by: Miguel Almeida * Refactor YaRN implementation for LLaMA Iterate on YaRN implementation for LLaMA and remove diff from remaining models for increased PR modularity. This commit includes the following changes: - Merge 'yarn_rope_scaling' and 'rope_scaling' dictionaries - Remove unnecessary attributes ('extrapolation_factor' and 'finetuned') from YaRN classes - Inherit 'forward' method in YaRN classes from superclass - Rename 'yarn' method to 'compute_yarn_scaling' - Extend YaRN tests with further assertions - Fix style inconsistencies Co-authored-by: Miguel Monte e Freitas * Refactor Tensor Building Logic for YaRN - Comply with the the tensor building logic introduced in #30743 - Add referencing to the optimized Attention Factor equation - Remove Dynamic YaRN for a more agile deployment Co-authored-by: mig-mfreitas * remove unwanted file --------- Co-authored-by: Miguel Almeida Co-authored-by: mig-mfreitas Co-authored-by: Joao Gante commit 7405c1c77e4637768ea0ad5d27d8a4d8d67bfb19 Author: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Date: Tue Jul 23 10:56:21 2024 +0200 Add method to retrieve used chat template (#32032) encapsulate chat template logic commit 605f3245dcca34381c35520c35ba0b701ed80d58 Author: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Date: Tue Jul 23 10:11:12 2024 +0200 Fix mask creations of `GPTNeoX` and `GPT2` (#31944) * fix mask creation of gpt2 and gpt_neox caused by me * forgot the reshape of masks when shape > 2 * add tests for gpt neox and gpt2 * nit on a comment commit 2782aadae2b0b0c313eac3ee70f84f0335577635 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Tue Jul 23 14:55:16 2024 +0800 [modelling] remove un-necessary transpose for fa2 attention (#31749) * [whisper] remove un-necessary transpose for fa2 attention * propagate commit f83c6f1d02fba5e5ced9357b9c9196c76d937af3 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Tue Jul 23 14:54:38 2024 +0800 Remove `trust_remote_code` when loading Libri Dummy (#31748) * [whisper integration] use parquet dataset for testing * propagate to others * more propagation * last one commit 3aefb4ec7f957f9561a410eabc6f9d57b2f0384f Author: Raushan Turganbay Date: Tue Jul 23 10:23:55 2024 +0500 LLaVaNeXT: pad on right if training (#32134) * pad on right if training * docs * add tests commit 251a2409c694c29ee28e66c954670c483cf54961 Author: James Thewlis Date: Tue Jul 23 01:12:16 2024 -0400 Add llama3-llava-next-8b to llava_next conversion script (#31395) * Add llama3-llava-next-8b to llava_next conversion script Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo. * Exclude <|begin_of_text|> from prompt example This token gets added automatically, so it should not be included in the prompt example. * Add llava-next-72b and llava-next-110b Adds the Qwen-based LLaVA-Next models to the conversion script, along with changes to load the models on multiple GPUs for inference. * Add llama3 and qwen prompt formats to docs * Chat prompt and padding side left for llama3 batched * update * Update src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove code * better naming --------- Co-authored-by: raushan Co-authored-by: Raushan Turganbay Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 96a074fa7e2c04b904f72d9e827398d4c5f90f25 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Mon Jul 22 20:21:59 2024 +0200 Add new quant method (#32047) * Add new quant method * update * fix multi-device * add test * add offload * style * style * add simple example * initial doc * docstring * style again * works ? * better docs * switch to non persistant * remove print * fix init * code review commit bd9dca3b855b5a20ea11097b89c40f34d775f1c7 Author: Arthur <48595927+ArthurZucker@users.noreply.github.com> Date: Mon Jul 22 19:42:47 2024 +0200 set warning level to info for special tokens have been added (#32138) fixes #7002 commit 817a676bd711f9626e13578068b36ef09cf572dc Author: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Date: Mon Jul 22 18:29:50 2024 +0100 Don't default to other weights file when use_safetensors=True (#31874) * Don't default to other weights file when use_safetensors=True * Add tests * Update tests/utils/test_modeling_utils.py * Add clarifying comments to tests * Update tests/utils/test_modeling_utils.py * Update tests/utils/test_modeling_utils.py commit 74d0eb3fedf353bd670aa85ae8fcf4c85f287b5b Author: Yoni Gottesman Date: Mon Jul 22 20:24:43 2024 +0300 Return assistant generated tokens mask in apply_chat_template (#30650) return assistant generated tokens mask in apply_chat_template commit 7987710696803c74ce1b5e7f9dfa055096a6c00e Author: Bertrand Thia <56003053+bt2513@users.noreply.github.com> Date: Mon Jul 22 13:08:27 2024 -0400 [RoBERTa] Minor clarifications to model doc (#31949) * minor edits and clarifications * address comment Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> commit 12b6880c81db7742a29ea425dcb9e63b7dbdc449 Author: Sai-Suraj-27 Date: Mon Jul 22 22:16:17 2024 +0530 fix: Fixed raising `TypeError` instead of `ValueError` for invalid type (#32111) * Raised TypeError instead of ValueError for invalid types. * Updated formatting using ruff. * Retrieved few changes. * Retrieved few changes. * Updated tests accordingly. commit d1ec36b94f5ba45fb2423e74074cfedab48cfe73 Author: Woojun Jung <46880056+jungnerd@users.noreply.github.com> Date: Tue Jul 23 00:27:13 2024 +0900 Update `ko/_toctree.yml` and remove `custom_tools.md` to reflect latest changes (#31969) update `ko/_toctree.yml` and remove `custom_tools.md` commit 7ba028fccb82cbee792b67d596120da8ae9397c9 Author: Matt Date: Mon Jul 22 16:07:29 2024 +0100 Fix failing test with race condition (#32140) * Fix failing test with race condition * make fixup * monotonic_ns instead of randint * uuid4 instead of monotonic_ns * Add a finally cleanup step commit 5a649ff3ecd70599dd0fea7ee430ba47b51a4556 Author: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Date: Mon Jul 22 21:18:48 2024 +0800 [generate] fix eos/pad id check on mps devices (#31695) Co-authored-by: Joao Gante commit f2a1e3ca684df624016285266a0ae519e4483be7 Author: Lucain Date: Mon Jul 22 15:14:47 2024 +0200 Mention model_info.id instead of model_info.modelId (#32106) commit 0fcfc5ccc968ff5a1a439db04a94f566a0bd1d89 Author: Sai-Suraj-27 Date: Mon Jul 22 18:43:39 2024 +0530 fix: Replaced deprecated `mktemp()` function (#32123) Replaced deprecated mktemp function. commit c38c55f4fbc0163cc02ef4588fe2ec391171a2f0 Author: Joao Gante Date: Mon Jul 22 14:06:49 2024 +0100 Generate: store special token tensors under a unique variable name (#31980) * rename stuff * english; this one shouldn't be changed * add a _ to the new var names * musicgen * derp commit aa8f86a421e23fe41b6333efc11ea4248e098d83 Author: Brian <23239305+b-chu@users.noreply.github.com> Date: Mon Jul 22 08:06:22 2024 -0400 Fix shard order (#32023) commit b3818805978b411713725a1b7470dc1bda073c29 Author: Aymeric Roucher <69208727+aymeric-roucher@users.noreply.github.com> Date: Mon Jul 22 10:49:57 2024 +0200 Agents planning (#31702) * Allow planning for agents commit 0fdea8607d7e01eb0e38a1ebeb7feee30a22f0cf Author: Lucain Date: Fri Jul 19 20:32:39 2024 +0200 Fix tests after `huggingface_hub` 0.24 (#32054) * adapt tests * style * comment commit fe008d6ebea1f5770b740991daeefd9322fa434a Author: Raushan Turganbay Date: Fri Jul 19 19:21:45 2024 +0500 Chameleon: not supported with fast load (#32091) fixes commit 62aa270f2ab3acca2a58cde8f08400ec49330b03 Author: Zach Mueller Date: Fri Jul 19 08:58:53 2024 -0400 Disable quick init for deepspeed (#32066) Disable via deepspeed commit 89575b567e061fd87bdd655ba188b6c7a922d54a Author: Kamil Akesbi <45195979+kamilakesbi@users.noreply.github.com> Date: Fri Jul 19 13:42:22 2024 +0100 Support generating with fallback for short form audio in Whisper (#30984) * remove is_shortform * adapt _retrieve_max_frames_and_seek for short_form * return bos token in short and long form * add decoder_input_ids to short form audios * add eos token for short form * handle short form token_timestamps * no need to return scores * add is_shortform conditions * handle when max_new_tokens is None - short form * handle assistant decoding * fix * handle return_dict_in_generate * handle split_by_batch for encoder_attentions attribute * handle num_beams>1 * handle num_return_sequences>1 in generate_with_fallback * handle num_return_sequences>1 with return_dict_in_generate=True * raise error if max_new_tokens + decoder_inputs_ids > max_target_pos * fix * apply review suggestions * fix * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * fix * logits for both short form and long form * handle if logits_processor is None * test * apply review changes to num_return_sequences * add _expand_variables_for_generation * remove short form commented section * update comments * uncomment num_beams line in generate_with_fallback * update assistant decoding * handle return_segment with short form generation * up * fix output format is_shortform * overwrite beam_sample test * update _set_return_timestamps * apply review suggestions * apply review suggestions * remove seek_outputs_short_form * fix _stack_split_outputs * fix stack dim in _stack_split_outputs * update tests * fix past_key_values + beam tests * fix * clean _expand_variables_for_generation * make style * fix slow tests * make style * max_length condition * make style * add slow tests for shortform fallback * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * apply review changes * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * up * fix slow tests * apply review suggestions * update test * make style * small fix * fix * fix test_new_cache_format * fix past_key_values * fix * make style * fix slow tests * fix --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> commit 46835ec6aed62e9a73784f1b6a43030afd601e5e Author: Merve Noyan Date: Fri Jul 19 15:40:40 2024 +0300 Add image-text-to-text task guide (#31777) * Add image-text-to-text task page * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Address comments * Fix heading * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/tasks/image_text_to_text.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Address comments * Update image_text_to_text.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 4bd8f12972c6ad06e264baa39f17ec9dfa9a5cb2 Author: Merve Noyan Date: Fri Jul 19 14:50:34 2024 +0300 Fixes to chameleon docs (#32078) * Fixes * Let's not use auto commit 566b0f1fbf5feb53a18591ca215a8d1245a790ef Author: Keith Stevens Date: Fri Jul 19 03:56:45 2024 -0700 Fix progress callback deepcopy (#32070) * Replacing ProgressCallbacks deepcopy with a shallowcopy * Using items instead of entries * code cleanup for copy in trainer callback * Style fix for ProgressCallback commit e316c5214fe51de0bf8e824245bfd6225c9925aa Author: Raushan Turganbay Date: Fri Jul 19 15:38:01 2024 +0500 VideoLLaVa: fix chat format in docs (#32083) fix chat format commit 22f888b3fab3d914882b8f44896a5658712f535c Author: Joshua Lochner Date: Fri Jul 19 11:19:35 2024 +0200 [mistral] Fix FA2 attention reshape for Mistral Nemo (#32065) * [mistral] Fix FA2 attention reshape * [run-slow] mistral commit cd48553fc8375e1a28d4d82cfe231dedf6a23af8 Author: Kamil Akesbi <45195979+kamilakesbi@users.noreply.github.com> Date: Fri Jul 19 09:26:38 2024 +0100 Incorrect Whisper long-form decoding timestamps (#32003) * fix lo form timestamps in decode_batch * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> * add test * make style * fix copies * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/whisper/processing_whisper.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/whisper/tokenization_whisper.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * apply review suggestions * fix * fix copies * fix * Update src/transformers/models/whisper/tokenization_whisper_fast.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix-copies --------- Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> commit 56a7745704261919dd8117e3a8aa4fb43fade30e Author: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Date: Fri Jul 19 10:20:03 2024 +0200 [Chameleon, Hiera] Improve docs (#32038) * Improve docs * Fix docs * Fix code snippet commit b873234cb649a24865021f0d598627ce2b24d34a Author: Raushan Turganbay Date: Fri Jul 19 10:08:56 2024 +0500 Llava: add default chat templates (#31691) * add default chat templates * Update src/transformers/models/llava/processing_llava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/llava_next/processing_llava_next.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * more clear docstring and docs * Update docs/source/en/model_doc/llava.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * add tests * remove default templates (see #31733) * load chat template from another file * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * revert some changes in docs * forgot vipllava * chat template file is not temporary hack * warn if loading from processor * not that file * similarly modify `save_pretrained` * Update tests/models/llava_next/test_processor_llava_next.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/vipllava/test_processor_vipllava.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/vipllava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/processing_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/llava_next.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> commit 271fd8e60d26b1773321c8b01fb67ed831fd3494 Author: Sai-Suraj-27 Date: Fri Jul 19 01:58:36 2024 +0530 docs: Fixed 2 links in the docs along with some minor fixes (#32058) * Fixed 2 links in the docs along with some minor fixes. * Updated Contributing.md commit 8f0d26c55e5350be3aefdc94a106a24b147204bc Author: Sai-Suraj-27 Date: Thu Jul 18 21:56:08 2024 +0530 fix: Removed `duplicate entries` in a dictionary (#32041) Removed duplicate key in a dictionary. commit c75969ee28eb293388bd587669e414c5a4f8d79f Author: Longjie Zheng <32992656+zhenglongjiepheonix@users.noreply.github.com> Date: Thu Jul 18 11:54:54 2024 -0400 Add torch.compile Support For Mamba (#31247) * modify mamba cache * set up cache * add test * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * use_cache_position * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * fix * cache in generate * [run-slow] mamba * address comments * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * fix * [run-slow] mamba * fix * [run-slow] mamba * fix cache name * [run-slow] mamba commit 4c040aba02b0283619a06bdc40ecf868508b9e52 Author: Joshua Lochner Date: Thu Jul 18 16:41:12 2024 +0200 [mistral] Support passing `head_dim` through config (and do not require `head_dim * num_heads == hidden_size`) (#32050) * Allow `head_dim` to be set in Mistral config * Add docstring * Do not require `head_dim * num_heads == hidden_size` * [run-slow] mistral commit c50e0551fd5652e69360d8451a36be4f10bf04dc Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 18 13:29:56 2024 +0100 Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples (#32052) Bump scikit-learn in /examples/research_projects/codeparrot/examples Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.1.2 to 1.5.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](https://github.com/scikit-learn/scikit-learn/compare/1.1.2...1.5.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit c25dde1fc97f47069d2e5e84e5a9a2ba4569b372 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jul 18 13:13:38 2024 +0100 Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/research_projects/decision_transformer (#31458) Bump scikit-learn in /examples/research_projects/decision_transformer Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.0.2 to 1.5.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](https://github.com/scikit-learn/scikit-learn/compare/1.0.2...1.5.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 673d30b8268b584f9c600be27e048de0c714c0af Author: Raushan Turganbay Date: Thu Jul 18 16:54:07 2024 +0500 Chameleon: minor fixes after shipping (#32037) * fix merging * make chameleon conditional commit 765732e92c17a17682a5efd4aa4323e5ab01fe07 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Thu Jul 18 11:26:01 2024 +0200 unpin `numpy<2.0` (#32018) * unpin np * [test_all] trigger full CI --------- Co-authored-by: ydshieh commit 1c37e8c1a6274e6e87b45c6319eb190757214c2a Author: Pavel Iakubovskii Date: Thu Jul 18 06:00:37 2024 +0100 Add `sdpa` and FA2 for CLIP (#31940) * Squashed commit of the following: commit 102842cd477219b9f9bcb23a0bca3a8b92bd732f Author: Pavel Iakubovskii Date: Fri Jul 12 18:23:52 2024 +0000 Add model-specific sdpa tests commit 60e4c88581abf89ec098da84ed8e92aa904c997d Author: Pavel Iakubovskii Date: Fri Jul 12 18:20:53 2024 +0000 Add fallback to eager (expensive operation) commit c29033d30e7ffde4327e8a15cbbc6bee37546f80 Author: Pavel Iakubovskii Date: Thu Jul 11 17:09:55 2024 +0000 Fix attn_implementation propagation commit 783aed05f0f38cb2f99e758f81db6838ac55b9f8 Author: sayakpaul Date: Sat May 25 09:05:27 2024 +0530 style commit e77e703ca75d00447cda277eca6b886cd32bddc0 Author: sayakpaul Date: Sat May 25 09:04:57 2024 +0530 add comment to explain why I had to touch forbidden codebase. commit ab9d8849758e7773a31778ccba71588d18552623 Author: sayakpaul Date: Sat May 25 09:03:02 2024 +0530 fix: flax attribute access. commit c570fc0abf9d1bd58c291aae3c7e384f995996d2 Author: sayakpaul Date: Sat May 25 08:23:54 2024 +0530 fix tensorflow attribute name. commit 32c812871cfdb268d8a6e3e2c61c5c925c8ed47e Author: sayakpaul Date: Sat May 25 07:57:10 2024 +0530 fix attribute access. commit 4f41a0138b6c417aed9c9332278f8bcd979cb7c2 Author: sayakpaul Date: Sat May 25 07:44:02 2024 +0530 _from_config. commit 35aed64ff602422adcf41d7f677a0a24bd9eccae Author: sayakpaul Date: Fri May 24 18:46:52 2024 +0530 propagation of attn_implementation. commit 4c25c19845438b1dc1d35a5adf9436151c8c5940 Author: sayakpaul Date: Fri May 24 09:24:36 2024 +0530 style again commit 5f7dc5c5015c0f8116408f737e8c318d1802c80c Author: sayakpaul Date: Fri May 24 09:19:05 2024 +0530 use from_config. commit b70c409956d0359fa6ae5372275d2a20ba7e3389 Author: sayakpaul Date: Fri May 24 09:13:43 2024 +0530 quality commit a7b63beff53d0fc754c6564e2a7b51731ddee49d Author: sayakpaul Date: Fri May 10 14:35:10 2024 +0200 add benchmark numbers commit 455b0eaea50862b8458c8f422b60fe60ae40fdcb Author: sayakpaul Date: Fri May 10 13:50:16 2024 +0200 Revert "reflect feedback more" This reverts commit dc123e71eff60aae74d5f325f113d515d0d71117. commit ca674829d28787349c2a9593a14e0f1d41f04ea4 Author: sayakpaul Date: Fri May 10 13:50:05 2024 +0200 Revert "fix" This reverts commit 37a1cb35b87acdc4cf7528b8b1ed6da27d244e52. commit fab2dd8576c099eb1a3464958cb206a664d28247 Author: sayakpaul Date: Fri May 10 13:47:46 2024 +0200 fix commit fbc6ae50fd6f2d36294d31e191761631b701d696 Author: sayakpaul Date: Fri May 10 13:38:30 2024 +0200 reflect feedback more commit 87245bb020b2d60a89afe318a951df0159404fc9 Author: sayakpaul Date: Fri May 3 08:54:34 2024 +0530 fixes commit 1057cc26390ee839251e7f8b3326c4207595fb23 Author: sayakpaul Date: Fri May 3 07:49:03 2024 +0530 don't explicit set attn_implementation in tests commit e33f75916fc8a99f516b1cf449dbbe9d3aabda81 Author: sayakpaul Date: Fri May 3 07:43:54 2024 +0530 explicitly override attn_implementation in the towers. commit 4cf41cb1bc885c39df7cb8f2a0694ebf23299235 Author: sayakpaul Date: Fri May 3 07:38:42 2024 +0530 import in one-line. commit f2cc447ae9e74ccfacb448140cdf88259d4afc8c Author: sayakpaul Date: Fri May 3 07:34:58 2024 +0530 move sdpa mention to usage tips. commit 92884766c64dbb456926a3a84dd427be1349fa95 Author: sayakpaul Date: Mon Apr 29 10:58:26 2024 +0530 fix: memory allocation problem. commit d7ffbbfe12f7750b7d0a361420f35c13e0ea787d Author: sayakpaul Date: Mon Apr 29 09:56:59 2024 +0530 fix-copies commit 8dfc3731cedd02e36acd3fe56bb2e6d61efd25d8 Author: sayakpaul Date: Fri Apr 26 20:16:12 2024 +0530 address arthur's comments. commit d2ed7b4ce4ff15ae9aa4d3d0500f1544e3dcd9e9 Author: Sayak Paul Date: Fri Apr 26 20:08:15 2024 +0530 Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> commit 46e04361f37ded5c522ff05e9f725b9f82dce40e Author: sayakpaul Date: Wed Apr 24 09:55:27 2024 +0530 add to docs. commit 831629158ad40d34d8983f209afb2740ba041af2 Author: sayakpaul Date: Wed Apr 24 09:33:10 2024 +0530 styling.g commit d263a119c77314250f4b4c8469caf42559197f22 Author: sayakpaul Date: Wed Apr 24 09:15:20 2024 +0530 up commit d44f9d3d7633d4c241a737a1bc317f791f6aedb3 Author: sayakpaul Date: Tue Apr 23 18:40:42 2024 +0530 handle causal and attention mask commit 122f1d60153df6666b634a94e38d073f3f260926 Author: sayakpaul Date: Tue Apr 23 15:18:21 2024 +0530 test fixes. commit 4382d8cff6fa1dee5dbcf0d06b3e2841231e36f5 Author: sayakpaul Date: Tue Apr 23 09:39:25 2024 +0530 fix: scaling inside sdpa. commit 0f629989efc48b7315cf19405a81e02955efe7e5 Author: Sayak Paul Date: Tue Apr 23 08:14:58 2024 +0530 Update src/transformers/models/clip/modeling_clip.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> commit 14367316877dc27ea40f767ad1aee38bbc97e4ce Author: sayakpaul Date: Mon Apr 22 16:21:36 2024 +0530 add: sdpa support to clip. * Remove fallback for empty attention mask (expensive operation) * Fix typing in copies * Add flash attention * Add flash attention tests * List CLIP in FA docs * Fix embeddings attributes and tf * [run-slow] clip * Update clip documentation * Remove commented code, skip compile dynamic for CLIPModel * Fix doc * Fix doc 2 * Remove double transpose * Add torch version check for contiguous() * Add comment to test mixin * Fix copies * Add comment for mask * Update docs * [run-slow] clip commit b31d59504003c8140adf66a4077b1c50799fbe89 Author: Robin Bakker Date: Wed Jul 17 22:32:53 2024 +0200 Add language to word timestamps for Whisper (#31572) * add language to words _collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information * ran style checks added missing comma * add new language test test that the pipeline can return both the language and timestamp * remove model configuration in test Removed model configurations that do not influence test results * remove model configuration in test Removed model configurations that do not influence test results commit cb23d1b20b4450195564f107ac873e5ecd616def Author: Francesco Cariaggi Date: Wed Jul 17 21:42:53 2024 +0200 Pass missing arguments to `SeamlessM4Tv2ConformerEncoderLayer.forward()` when gradient checkpointing is enabled (#31945) * pass missing arguments when gradient checkpointing is enabled for SeamlessM4Tv2 * fix same bug in SeamlessM4Tv1 * pass args, not kwargs commit bc36c26fa62fce8b89b13d42e2dbdfafeeb102ca Author: Dmitry Rogozhkin Date: Wed Jul 17 12:24:10 2024 -0700 doc: fix broken BEiT and DiNAT model links on Backbone page (#32029) Signed-off-by: Dmitry Rogozhkin commit 63be8e6f3963d424793e52b85d09e53f509e8618 Author: Moses Hohman Date: Wed Jul 17 14:20:39 2024 -0500 Fix typo in classification function selection logic to improve code consistency (#32031) Make problem_type condition consistent with num_labels condition The latter condition generally overrides the former, so this is more of a code reading issue. I'm not sure the bug would ever actually get triggered under normal use. --- .circleci/config.yml | 3 +- .github/workflows/self-pr-slow-ci.yml | 2 +- .github/workflows/trufflehog.yml | 23 +- CONTRIBUTING.md | 6 +- Makefile | 1 + docker/consistency.dockerfile | 2 +- docker/transformers-all-latest-gpu/Dockerfile | 2 +- docker/transformers-pytorch-gpu/Dockerfile | 2 +- docs/source/en/_toctree.yml | 10 +- docs/source/en/agents.md | 51 ++ docs/source/en/chat_templating.md | 19 +- docs/source/en/conversations.md | 2 +- docs/source/en/generation_strategies.md | 37 -- docs/source/en/internal/generation_utils.md | 18 + docs/source/en/kv_cache.md | 346 +++++++++++ docs/source/en/llm_optims.md | 139 +++-- docs/source/en/main_classes/agent.md | 4 + docs/source/en/main_classes/backbones.md | 6 +- docs/source/en/main_classes/data_collator.md | 5 + docs/source/en/main_classes/quantization.md | 5 + docs/source/en/model_doc/chameleon.md | 45 +- docs/source/en/model_doc/clip.md | 117 ++++ docs/source/en/model_doc/dinov2.md | 2 +- docs/source/en/model_doc/grounding-dino.md | 61 +- docs/source/en/model_doc/hiera.md | 14 + docs/source/en/model_doc/llava-next-video.md | 7 + docs/source/en/model_doc/llava.md | 38 +- docs/source/en/model_doc/llava_next.md | 114 +++- docs/source/en/model_doc/marian.md | 2 +- docs/source/en/model_doc/qwen2.md | 10 +- docs/source/en/model_doc/roberta.md | 24 +- docs/source/en/model_doc/video_llava.md | 6 +- docs/source/en/model_doc/vipllava.md | 48 +- docs/source/en/model_doc/whisper.md | 5 +- docs/source/en/model_sharing.md | 2 +- docs/source/en/perf_infer_gpu_one.md | 2 + docs/source/en/perf_torch_compile.md | 2 +- docs/source/en/quantization/fbgemm_fp8.md | 58 ++ docs/source/en/quantization/overview.md | 1 + docs/source/en/tasks/image_text_to_text.md | 232 ++++++++ docs/source/en/tf_xla.md | 4 +- docs/source/en/trainer.md | 6 +- docs/source/es/chat_templating.md | 8 +- docs/source/ja/chat_templating.md | 2 +- docs/source/ko/_toctree.yml | 34 +- docs/source/ko/custom_tools.md | 22 - docs/source/zh/_toctree.yml | 2 + docs/source/zh/chat_templating.md | 2 +- docs/source/zh/philosophy.md | 67 +++ .../language-modeling/t5_tokenizer_model.py | 4 +- examples/flax/question-answering/run_qa.py | 2 +- .../run_flax_speech_recognition_seq2seq.py | 2 +- .../flax/text-classification/run_flax_glue.py | 2 +- .../flax/token-classification/run_flax_ner.py | 2 +- .../run_audio_classification.py | 2 +- .../contrastive-image-text/run_clip.py | 2 +- .../run_image_classification.py | 2 +- .../run_image_classification_no_trainer.py | 2 +- examples/pytorch/image-pretraining/run_mae.py | 2 +- examples/pytorch/image-pretraining/run_mim.py | 2 +- .../image-pretraining/run_mim_no_trainer.py | 2 +- .../run_instance_segmentation.py | 2 +- .../run_instance_segmentation_no_trainer.py | 2 +- examples/pytorch/language-modeling/run_clm.py | 2 +- .../language-modeling/run_clm_no_trainer.py | 2 +- examples/pytorch/language-modeling/run_fim.py | 2 +- .../language-modeling/run_fim_no_trainer.py | 2 +- examples/pytorch/language-modeling/run_mlm.py | 2 +- .../language-modeling/run_mlm_no_trainer.py | 2 +- examples/pytorch/language-modeling/run_plm.py | 2 +- examples/pytorch/multiple-choice/run_swag.py | 2 +- .../multiple-choice/run_swag_no_trainer.py | 2 +- .../object-detection/run_object_detection.py | 2 +- .../run_object_detection_no_trainer.py | 2 +- examples/pytorch/question-answering/run_qa.py | 2 +- .../question-answering/run_qa_beam_search.py | 2 +- .../run_qa_beam_search_no_trainer.py | 2 +- .../question-answering/run_qa_no_trainer.py | 2 +- .../question-answering/run_seq2seq_qa.py | 2 +- .../run_semantic_segmentation.py | 2 +- .../run_semantic_segmentation_no_trainer.py | 2 +- .../run_speech_recognition_ctc.py | 2 +- .../run_speech_recognition_ctc_adapter.py | 2 +- .../run_speech_recognition_seq2seq.py | 2 +- .../summarization/run_summarization.py | 2 +- .../run_summarization_no_trainer.py | 2 +- .../text-classification/run_classification.py | 2 +- .../pytorch/text-classification/run_glue.py | 2 +- .../run_glue_no_trainer.py | 2 +- .../pytorch/text-classification/run_xnli.py | 2 +- .../pytorch/token-classification/run_ner.py | 2 +- .../run_ner_no_trainer.py | 2 +- .../pytorch/translation/run_translation.py | 2 +- .../translation/run_translation_no_trainer.py | 2 +- .../bertabs/modeling_bertabs.py | 10 +- .../codeparrot/examples/requirements.txt | 2 +- .../decision_transformer/requirements.txt | 4 +- .../distillation/grouped_batch_sampler.py | 2 +- .../fsner/src/fsner/tokenizer_utils.py | 2 +- .../lxmert/modeling_frcnn.py | 8 +- .../emmental/modules/binarizer.py | 2 +- .../movement-pruning/masked_run_glue.py | 2 +- .../movement-pruning/masked_run_squad.py | 2 +- .../modeling_flax_performer_utils.py | 4 +- .../_test_seq2seq_examples.py | 4 +- .../research_projects/tapex/wikisql_utils.py | 4 +- .../visual_bert/modeling_frcnn.py | 8 +- .../contrastive-image-text/run_clip.py | 2 +- .../run_image_classification.py | 2 +- .../tensorflow/multiple-choice/run_swag.py | 2 +- .../tensorflow/question-answering/run_qa.py | 2 +- .../summarization/run_summarization.py | 2 +- .../text-classification/run_glue.py | 2 +- .../tensorflow/translation/run_translation.py | 2 +- scripts/benchmark/trainer-benchmark.py | 2 +- setup.py | 6 +- src/transformers/__init__.py | 22 +- src/transformers/agents/__init__.py | 2 + src/transformers/agents/agent_types.py | 2 +- src/transformers/agents/agents.py | 283 ++++++--- src/transformers/agents/default_tools.py | 2 +- src/transformers/agents/evaluate_agent.py | 2 +- src/transformers/agents/monitoring.py | 75 +++ src/transformers/agents/prompts.py | 418 ++++++++++++- src/transformers/agents/python_interpreter.py | 489 +++++++++------- src/transformers/audio_utils.py | 2 +- src/transformers/cache_utils.py | 445 +++++++++++++- src/transformers/commands/pt_to_tf.py | 4 +- src/transformers/configuration_utils.py | 2 +- src/transformers/convert_slow_tokenizer.py | 234 +++----- src/transformers/data/__init__.py | 1 + src/transformers/data/data_collator.py | 37 +- src/transformers/data/processors/xnli.py | 12 +- src/transformers/dependency_versions_table.py | 4 +- .../generation/beam_constraints.py | 10 +- .../generation/candidate_generator.py | 34 +- src/transformers/generation/logits_process.py | 6 +- .../generation/stopping_criteria.py | 5 +- src/transformers/generation/utils.py | 343 ++++++----- src/transformers/hf_argparser.py | 2 +- src/transformers/image_processing_base.py | 2 +- src/transformers/image_transforms.py | 8 +- src/transformers/image_utils.py | 3 +- src/transformers/integrations/__init__.py | 2 + src/transformers/integrations/awq.py | 2 +- src/transformers/integrations/fbgemm_fp8.py | 161 +++++ src/transformers/integrations/ggml.py | 49 +- .../integrations/integration_utils.py | 111 ++-- src/transformers/integrations/peft.py | 4 +- src/transformers/modeling_attn_mask_utils.py | 2 +- .../modeling_flash_attention_utils.py | 89 ++- src/transformers/modeling_flax_utils.py | 2 +- .../modeling_gguf_pytorch_utils.py | 4 +- src/transformers/modeling_rope_utils.py | 554 ++++++++++++++++++ src/transformers/modeling_tf_utils.py | 6 +- src/transformers/modeling_utils.py | 100 +++- .../models/align/modeling_align.py | 4 +- .../models/altclip/modeling_altclip.py | 9 +- src/transformers/models/auto/modeling_auto.py | 2 +- .../autoformer/configuration_autoformer.py | 4 +- .../models/bark/processing_bark.py | 2 +- src/transformers/models/bart/modeling_bart.py | 3 +- .../models/bart/modeling_flax_bart.py | 2 +- src/transformers/models/bert/modeling_bert.py | 2 +- .../models/bert/modeling_tf_bert.py | 2 +- .../models/bert/tokenization_bert.py | 4 +- .../tokenization_bert_japanese.py | 6 +- .../models/big_bird/modeling_big_bird.py | 2 +- .../modeling_bigbird_pegasus.py | 1 + src/transformers/models/bit/modeling_bit.py | 4 +- .../blenderbot/tokenization_blenderbot.py | 14 - .../tokenization_blenderbot_fast.py | 15 - .../tokenization_blenderbot_small.py | 15 - .../tokenization_blenderbot_small_fast.py | 15 - .../convert_blip_original_pytorch_to_hf.py | 2 +- src/transformers/models/blip/modeling_blip.py | 4 +- .../models/blip/modeling_tf_blip.py | 4 +- .../models/bloom/modeling_bloom.py | 448 ++++++++------ .../models/bloom/tokenization_bloom_fast.py | 8 - .../image_processing_bridgetower.py | 2 +- .../models/camembert/modeling_camembert.py | 2 +- src/transformers/models/chameleon/__init__.py | 4 +- .../convert_chameleon_weights_to_hf.py | 2 +- .../models/chameleon/modeling_chameleon.py | 118 +++- .../models/chameleon/processing_chameleon.py | 2 +- .../chinese_clip/modeling_chinese_clip.py | 4 +- src/transformers/models/clap/modeling_clap.py | 4 +- src/transformers/models/clip/modeling_clip.py | 208 ++++++- .../models/clip/modeling_tf_clip.py | 4 +- .../models/clip/tokenization_clip.py | 2 +- .../models/clipseg/modeling_clipseg.py | 17 +- .../models/clvp/feature_extraction_clvp.py | 2 +- src/transformers/models/clvp/modeling_clvp.py | 13 +- .../code_llama/tokenization_code_llama.py | 55 -- .../tokenization_code_llama_fast.py | 55 -- .../models/cohere/modeling_cohere.py | 127 +++- .../models/cohere/tokenization_cohere_fast.py | 182 ------ .../modeling_conditional_detr.py | 2 +- .../models/convbert/tokenization_convbert.py | 4 +- .../models/convnext/configuration_convnext.py | 4 +- .../convnextv2/configuration_convnextv2.py | 4 +- .../convnextv2/modeling_tf_convnextv2.py | 2 +- .../models/cpmant/tokenization_cpmant.py | 2 +- .../models/data2vec/modeling_data2vec_text.py | 2 +- .../models/dbrx/configuration_dbrx.py | 13 +- src/transformers/models/dbrx/modeling_dbrx.py | 133 ++++- .../models/deberta/configuration_deberta.py | 2 +- .../models/deberta/modeling_deberta.py | 20 +- .../models/deberta/modeling_tf_deberta.py | 4 +- .../deberta_v2/configuration_deberta_v2.py | 2 +- .../models/deberta_v2/modeling_deberta_v2.py | 20 +- .../deberta_v2/modeling_tf_deberta_v2.py | 4 +- .../deberta_v2/tokenization_deberta_v2.py | 2 +- .../modeling_deformable_detr.py | 2 +- .../models/deprecated/deta/modeling_deta.py | 2 +- .../tokenization_gptsan_japanese.py | 15 +- .../models/deprecated/mega/modeling_mega.py | 3 + .../deprecated/mmbt/configuration_mmbt.py | 2 +- .../open_llama/modeling_open_llama.py | 3 + .../deprecated/realm/tokenization_realm.py | 4 +- .../retribert/tokenization_retribert.py | 4 +- .../modeling_speech_to_text_2.py | 2 +- .../transfo_xl/tokenization_transfo_xl.py | 6 +- .../depth_anything/modeling_depth_anything.py | 2 +- src/transformers/models/detr/modeling_detr.py | 2 +- .../models/distilbert/modeling_distilbert.py | 2 +- .../distilbert/modeling_flax_distilbert.py | 2 +- .../distilbert/tokenization_distilbert.py | 4 +- .../models/dpt/convert_dinov2_depth_to_hf.py | 2 +- .../dpt/convert_dpt_hybrid_to_pytorch.py | 2 +- src/transformers/models/dpt/modeling_dpt.py | 2 +- .../models/electra/tokenization_electra.py | 4 +- .../models/ernie/modeling_ernie.py | 2 +- src/transformers/models/esm/modeling_esm.py | 2 +- .../models/esm/modeling_tf_esm.py | 2 +- .../models/esm/openfold_utils/chunk_utils.py | 6 +- .../esm/openfold_utils/residue_constants.py | 2 +- .../models/esm/openfold_utils/rigid_utils.py | 4 +- .../models/esm/openfold_utils/tensor_utils.py | 2 +- .../models/falcon/modeling_falcon.py | 14 +- .../models/flava/configuration_flava.py | 10 +- .../models/flava/modeling_flava.py | 8 +- src/transformers/models/fnet/modeling_fnet.py | 2 +- src/transformers/models/fsmt/modeling_fsmt.py | 4 +- .../models/funnel/tokenization_funnel.py | 4 +- .../models/fuyu/configuration_fuyu.py | 1 - src/transformers/models/gemma/diff_gemma.py | 7 +- .../models/gemma/modeling_gemma.py | 124 +++- .../models/gemma/tokenization_gemma.py | 2 +- .../models/gemma2/modeling_gemma2.py | 162 +++-- src/transformers/models/git/modeling_git.py | 10 +- src/transformers/models/gpt2/modeling_gpt2.py | 26 +- .../models/gpt2/tokenization_gpt2.py | 7 - .../models/gpt2/tokenization_gpt2_fast.py | 9 - .../models/gpt_neox/configuration_gpt_neox.py | 1 - .../models/gpt_neox/modeling_gpt_neox.py | 36 +- .../gpt_neox/tokenization_gpt_neox_fast.py | 8 - .../tokenization_gpt_neox_japanese.py | 14 +- .../models/gpt_sw3/tokenization_gpt_sw3.py | 16 - src/transformers/models/gptj/modeling_gptj.py | 3 +- .../grounding_dino/modeling_grounding_dino.py | 2 +- .../models/groupvit/modeling_groupvit.py | 8 +- .../models/groupvit/modeling_tf_groupvit.py | 4 +- .../models/herbert/tokenization_herbert.py | 2 +- .../models/hubert/modeling_hubert.py | 2 +- .../models/hubert/modeling_tf_hubert.py | 4 +- .../models/ibert/modeling_ibert.py | 2 +- .../models/idefics/configuration_idefics.py | 2 +- .../models/idefics/modeling_idefics.py | 3 + src/transformers/models/idefics/vision.py | 6 +- .../models/idefics2/modeling_idefics2.py | 13 +- .../models/idefics2/processing_idefics2.py | 57 -- .../models/jamba/modeling_jamba.py | 10 +- .../models/jetmoe/modeling_jetmoe.py | 102 +++- .../models/kosmos2/modeling_kosmos2.py | 8 +- .../models/kosmos2/processing_kosmos2.py | 2 +- .../models/layoutlm/tokenization_layoutlm.py | 4 +- .../layoutlmv2/tokenization_layoutlmv2.py | 4 +- .../models/llama/configuration_llama.py | 81 +-- .../llama/convert_llama_weights_to_hf.py | 154 +++-- .../models/llama/modeling_llama.py | 319 +++++++--- .../models/llama/tokenization_llama.py | 57 +- .../models/llama/tokenization_llama_fast.py | 55 -- .../models/llava/modeling_llava.py | 1 + .../models/llava/processing_llava.py | 6 +- .../convert_llava_next_weights_to_hf.py | 125 ++-- .../llava_next/image_processing_llava_next.py | 2 +- .../models/llava_next/modeling_llava_next.py | 17 +- .../modeling_llava_next_video.py | 17 +- .../processing_llava_next_video.py | 60 -- .../models/longformer/modeling_longformer.py | 2 +- .../models/luke/tokenization_luke.py | 2 +- .../models/lxmert/modeling_lxmert.py | 16 + .../models/lxmert/tokenization_lxmert.py | 4 +- .../models/mamba/configuration_mamba.py | 4 + .../models/mamba/modeling_mamba.py | 229 +++++--- .../marian/convert_marian_to_pytorch.py | 2 +- .../markuplm/feature_extraction_markuplm.py | 2 +- .../mask2former/modeling_mask2former.py | 2 +- .../models/mbart/modeling_flax_mbart.py | 2 +- .../models/mbart/modeling_mbart.py | 3 +- .../megatron_bert/modeling_megatron_bert.py | 2 +- .../models/mistral/configuration_mistral.py | 4 + .../models/mistral/modeling_mistral.py | 32 +- .../models/mixtral/modeling_mixtral.py | 107 +++- .../models/mluke/tokenization_mluke.py | 2 +- .../mobilebert/tokenization_mobilebert.py | 4 +- .../models/mpnet/tokenization_mpnet.py | 4 +- src/transformers/models/mt5/modeling_mt5.py | 2 +- .../models/musicgen/modeling_musicgen.py | 137 ++--- .../modeling_musicgen_melody.py | 162 ++--- .../models/olmo/configuration_olmo.py | 1 - src/transformers/models/olmo/modeling_olmo.py | 138 ++++- .../oneformer/image_processing_oneformer.py | 2 +- .../models/openai/tokenization_openai.py | 2 +- .../models/owlv2/image_processing_owlv2.py | 17 +- .../models/owlv2/modeling_owlv2.py | 6 +- .../models/owlvit/image_processing_owlvit.py | 15 +- .../models/owlvit/modeling_owlvit.py | 6 +- .../models/paligemma/modeling_paligemma.py | 1 + .../patchtsmixer/modeling_patchtsmixer.py | 31 +- .../persimmon/configuration_persimmon.py | 1 - .../models/persimmon/modeling_persimmon.py | 125 +++- .../models/phi/configuration_phi.py | 1 - src/transformers/models/phi/modeling_phi.py | 123 +++- .../models/phi3/configuration_phi3.py | 20 +- src/transformers/models/phi3/modeling_phi3.py | 197 +++++-- .../prophetnet/tokenization_prophetnet.py | 4 +- .../models/qwen2/modeling_qwen2.py | 129 +++- .../models/qwen2_moe/modeling_qwen2_moe.py | 133 ++++- src/transformers/models/rag/modeling_rag.py | 4 +- .../models/rag/modeling_tf_rag.py | 4 +- src/transformers/models/rag/retrieval_rag.py | 2 +- .../modeling_recurrent_gemma.py | 3 + .../models/roberta/modeling_roberta.py | 2 +- .../modeling_roberta_prelayernorm.py | 2 +- .../models/roc_bert/modeling_roc_bert.py | 2 +- .../models/roc_bert/tokenization_roc_bert.py | 4 +- .../models/roformer/tokenization_roformer.py | 4 +- .../models/rt_detr/modeling_rt_detr.py | 2 +- .../feature_extraction_seamless_m4t.py | 6 +- .../seamless_m4t/modeling_seamless_m4t.py | 3 + .../seamless_m4t/tokenization_seamless_m4t.py | 3 +- .../modeling_seamless_m4t_v2.py | 3 + .../models/seggpt/modeling_seggpt.py | 4 +- .../models/sew_d/modeling_sew_d.py | 20 +- .../models/siglip/modeling_siglip.py | 6 +- .../modeling_speech_encoder_decoder.py | 2 +- .../feature_extraction_speech_to_text.py | 2 +- .../speech_to_text/modeling_speech_to_text.py | 4 +- .../modeling_tf_speech_to_text.py | 2 +- .../models/splinter/tokenization_splinter.py | 4 +- .../squeezebert/tokenization_squeezebert.py | 4 +- .../models/stablelm/configuration_stablelm.py | 1 - .../models/stablelm/modeling_stablelm.py | 123 +++- .../models/starcoder2/modeling_starcoder2.py | 128 +++- src/transformers/models/t5/modeling_t5.py | 2 +- src/transformers/models/t5/tokenization_t5.py | 3 +- .../modeling_table_transformer.py | 2 +- .../models/tapas/modeling_tapas.py | 3 +- .../models/tapas/modeling_tf_tapas.py | 2 +- .../models/tapas/tokenization_tapas.py | 6 +- .../models/udop/configuration_udop.py | 2 +- .../models/udop/tokenization_udop.py | 5 +- .../models/udop/tokenization_udop_fast.py | 2 +- .../models/univnet/modeling_univnet.py | 2 +- .../image_processing_video_llava.py | 7 +- .../video_llava/modeling_video_llava.py | 7 +- .../models/vilt/image_processing_vilt.py | 2 +- .../models/vipllava/modeling_vipllava.py | 1 + .../models/vitdet/modeling_vitdet.py | 2 +- .../wav2vec2/feature_extraction_wav2vec2.py | 8 +- .../models/wav2vec2/modeling_flax_wav2vec2.py | 6 +- .../models/wav2vec2/modeling_tf_wav2vec2.py | 4 +- .../models/wav2vec2/modeling_wav2vec2.py | 2 +- .../configuration_wav2vec2_conformer.py | 4 +- .../modeling_wav2vec2_conformer.py | 2 +- .../processing_wav2vec2_with_lm.py | 2 +- .../whisper/feature_extraction_whisper.py | 12 +- .../models/whisper/generation_whisper.py | 426 +++++++++----- .../models/whisper/modeling_flax_whisper.py | 8 +- .../models/whisper/modeling_tf_whisper.py | 6 +- .../models/whisper/modeling_whisper.py | 130 +++- .../models/whisper/tokenization_whisper.py | 41 +- .../whisper/tokenization_whisper_fast.py | 14 +- .../models/x_clip/modeling_x_clip.py | 10 +- .../xlm_roberta/modeling_xlm_roberta.py | 2 +- .../xlm_roberta_xl/modeling_xlm_roberta_xl.py | 2 +- src/transformers/models/xmod/modeling_xmod.py | 2 +- .../models/yolos/modeling_yolos.py | 7 +- .../models/zoedepth/modeling_zoedepth.py | 2 +- .../pipelines/audio_classification.py | 2 +- src/transformers/pipelines/audio_utils.py | 55 +- .../pipelines/automatic_speech_recognition.py | 2 +- src/transformers/pipelines/base.py | 11 +- .../pipelines/document_question_answering.py | 2 +- src/transformers/pipelines/fill_mask.py | 2 +- .../pipelines/image_classification.py | 4 +- src/transformers/pipelines/pt_utils.py | 6 +- .../pipelines/question_answering.py | 2 +- .../zero_shot_audio_classification.py | 2 +- src/transformers/processing_utils.py | 70 ++- src/transformers/pytorch_utils.py | 1 + src/transformers/quantizers/auto.py | 11 +- .../quantizers/quantizer_bnb_4bit.py | 6 + .../quantizers/quantizer_fbgemm_fp8.py | 205 +++++++ src/transformers/testing_utils.py | 13 +- src/transformers/tokenization_utils.py | 17 +- src/transformers/tokenization_utils_base.py | 248 +++++--- src/transformers/trainer.py | 18 +- src/transformers/trainer_callback.py | 16 +- src/transformers/trainer_pt_utils.py | 12 +- src/transformers/trainer_seq2seq.py | 2 +- src/transformers/training_args.py | 6 +- src/transformers/utils/__init__.py | 2 + src/transformers/utils/chat_template_utils.py | 2 +- src/transformers/utils/dummy_pt_objects.py | 33 +- src/transformers/utils/generic.py | 2 +- src/transformers/utils/import_utils.py | 30 +- src/transformers/utils/quantization_config.py | 62 +- tests/agents/test_agents.py | 4 +- tests/agents/test_python_interpreter.py | 165 +++++- tests/agents/test_tools_common.py | 2 +- tests/deepspeed/test_deepspeed.py | 25 + tests/generation/test_configuration_utils.py | 138 +++-- tests/generation/test_utils.py | 118 +++- ...xtraction_audio_spectrogram_transformer.py | 4 +- tests/models/auto/test_processor_auto.py | 169 +++--- tests/models/bart/test_modeling_bart.py | 12 + tests/models/bloom/test_tokenization_bloom.py | 1 + .../chameleon/test_modeling_chameleon.py | 33 +- .../clap/test_feature_extraction_clap.py | 4 +- tests/models/clap/test_modeling_clap.py | 16 +- tests/models/clip/test_modeling_clip.py | 345 ++++++++++- .../clvp/test_feature_extraction_clvp.py | 4 +- tests/models/clvp/test_modeling_clvp.py | 8 +- tests/models/ctrl/test_modeling_tf_ctrl.py | 2 +- .../data2vec/test_modeling_data2vec_audio.py | 4 +- tests/models/dbrx/test_modeling_dbrx.py | 4 + tests/models/deberta/test_modeling_deberta.py | 2 +- .../deberta_v2/test_modeling_deberta_v2.py | 2 +- .../distilbert/test_modeling_distilbert.py | 2 +- .../test_feature_extraction_encodec.py | 4 +- tests/models/encodec/test_modeling_encodec.py | 12 +- .../models/flaubert/test_modeling_flaubert.py | 2 +- tests/models/gemma/test_modeling_gemma.py | 2 +- tests/models/gemma/test_tokenization_gemma.py | 35 ++ tests/models/gemma2/test_modeling_gemma2.py | 28 + tests/models/gpt2/test_modeling_gpt2.py | 34 ++ tests/models/gpt2/test_tokenization_gpt2.py | 1 + .../models/gpt_neox/test_modeling_gpt_neox.py | 34 ++ .../gpt_sw3/test_tokenization_gpt_sw3.py | 9 + tests/models/hubert/test_modeling_hubert.py | 4 +- .../models/hubert/test_modeling_tf_hubert.py | 4 +- tests/models/ibert/test_modeling_ibert.py | 4 +- .../models/idefics2/test_modeling_idefics2.py | 48 +- .../test_tokenization_layoutlmv2.py | 4 + .../test_tokenization_layoutlmv3.py | 4 + .../layoutxlm/test_tokenization_layoutxlm.py | 4 + tests/models/llama/test_modeling_llama.py | 171 +++++- tests/models/llava/test_modeling_llava.py | 13 + tests/models/llava/test_processor_llava.py | 17 + .../llava_next/test_modeling_llava_next.py | 38 +- .../llava_next/test_processor_llava_next.py | 41 ++ .../test_modeling_llava_next_video.py | 40 +- tests/models/luke/test_tokenization_luke.py | 6 +- .../models/lxmert/test_modeling_tf_lxmert.py | 2 +- tests/models/mamba/test_modeling_mamba.py | 37 +- tests/models/marian/test_modeling_marian.py | 2 +- .../markuplm/test_tokenization_markuplm.py | 4 + tests/models/mbart/test_modeling_mbart.py | 12 + tests/models/mluke/test_tokenization_mluke.py | 6 +- .../mobilebert/test_modeling_tf_mobilebert.py | 2 +- tests/models/phi3/test_modeling_phi3.py | 104 +++- .../test_feature_extraction_pop2piano.py | 4 +- .../pop2piano/test_processor_pop2piano.py | 4 +- .../roformer/test_tokenization_roformer.py | 2 +- .../test_feature_extraction_seamless_m4t.py | 61 +- tests/models/sew/test_modeling_sew.py | 4 +- tests/models/sew_d/test_modeling_sew_d.py | 4 +- .../test_feature_extraction_speech_to_text.py | 4 +- .../test_modeling_speech_to_text.py | 4 +- .../test_modeling_tf_speech_to_text.py | 4 +- .../test_feature_extraction_speecht5.py | 4 +- .../models/speecht5/test_modeling_speecht5.py | 8 +- .../squeezebert/test_modeling_squeezebert.py | 2 +- tests/models/tapas/test_tokenization_tapas.py | 4 + tests/models/udop/test_tokenization_udop.py | 4 + .../unispeech/test_modeling_unispeech.py | 4 +- .../test_modeling_unispeech_sat.py | 4 +- .../test_feature_extraction_univnet.py | 4 +- tests/models/univnet/test_modeling_univnet.py | 4 +- .../test_image_processing_video_llava.py | 43 +- .../vipllava/test_processor_vipllava.py | 41 ++ .../wav2vec2/test_modeling_flax_wav2vec2.py | 4 +- .../wav2vec2/test_modeling_tf_wav2vec2.py | 4 +- .../models/wav2vec2/test_modeling_wav2vec2.py | 4 +- .../test_modeling_wav2vec2_bert.py | 4 +- .../test_modeling_wav2vec2_conformer.py | 4 +- tests/models/wavlm/test_modeling_wavlm.py | 4 +- .../test_feature_extraction_whisper.py | 4 +- .../whisper/test_modeling_flax_whisper.py | 4 +- .../whisper/test_modeling_tf_whisper.py | 2 +- tests/models/whisper/test_modeling_whisper.py | 454 +++++++++++++- .../whisper/test_tokenization_whisper.py | 36 ++ .../test_pipelines_audio_classification.py | 4 +- ..._pipelines_automatic_speech_recognition.py | 180 +++--- tests/pipelines/test_pipelines_common.py | 4 +- .../test_pipelines_conversational.py | 439 -------------- .../test_pipelines_feature_extraction.py | 2 +- tests/quantization/fbgemm_fp8/__init__.py | 0 .../fbgemm_fp8/test_fbgemm_fp8.py | 270 +++++++++ tests/quantization/ggml/test_ggml.py | 25 +- tests/test_configuration_common.py | 2 +- tests/test_modeling_common.py | 153 ++++- tests/test_pipeline_mixin.py | 2 +- tests/test_tokenization_common.py | 183 ++++-- tests/tokenization/test_tokenization_utils.py | 12 + tests/trainer/test_data_collator.py | 19 + tests/trainer/test_trainer.py | 59 +- tests/utils/test_audio_utils.py | 4 +- tests/utils/test_cache_utils.py | 144 ++++- tests/utils/test_configuration_utils.py | 154 ++--- tests/utils/test_feature_extraction_utils.py | 144 ++--- tests/utils/test_image_processing_utils.py | 147 ++--- tests/utils/test_image_utils.py | 6 +- tests/utils/test_modeling_flax_utils.py | 164 +++--- tests/utils/test_modeling_rope_utils.py | 439 ++++++++++++++ tests/utils/test_modeling_tf_utils.py | 231 ++++---- tests/utils/test_modeling_utils.py | 342 +++++++---- tests/utils/test_tokenization_utils.py | 197 ++++--- utils/check_docstrings.py | 270 ++------- utils/notification_service.py | 41 +- utils/update_tiny_models.py | 2 +- 534 files changed, 12934 insertions(+), 5436 deletions(-) create mode 100644 docs/source/en/kv_cache.md create mode 100644 docs/source/en/quantization/fbgemm_fp8.md create mode 100644 docs/source/en/tasks/image_text_to_text.md delete mode 100644 docs/source/ko/custom_tools.md create mode 100644 docs/source/zh/philosophy.md create mode 100644 src/transformers/agents/monitoring.py create mode 100644 src/transformers/integrations/fbgemm_fp8.py create mode 100644 src/transformers/modeling_rope_utils.py create mode 100644 src/transformers/quantizers/quantizer_fbgemm_fp8.py create mode 100644 tests/models/llava_next/test_processor_llava_next.py create mode 100644 tests/models/vipllava/test_processor_vipllava.py delete mode 100644 tests/pipelines/test_pipelines_conversational.py create mode 100644 tests/quantization/fbgemm_fp8/__init__.py create mode 100644 tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py create mode 100644 tests/utils/test_modeling_rope_utils.py diff --git a/.circleci/config.yml b/.circleci/config.yml index cdd97f4fcecaff..6558dc1454b273 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -142,6 +142,7 @@ jobs: - run: python utils/custom_init_isort.py --check_only - run: python utils/sort_auto_mappings.py --check_only - run: python utils/check_doc_toc.py + - run: python utils/check_docstrings.py --check_all check_repository_consistency: working_directory: ~/transformers @@ -190,4 +191,4 @@ workflows: - check_circleci_user - check_code_quality - check_repository_consistency - - fetch_all_tests \ No newline at end of file + - fetch_all_tests diff --git a/.github/workflows/self-pr-slow-ci.yml b/.github/workflows/self-pr-slow-ci.yml index 8225e5b6aa7b1d..2287b5e3f31587 100644 --- a/.github/workflows/self-pr-slow-ci.yml +++ b/.github/workflows/self-pr-slow-ci.yml @@ -4,7 +4,7 @@ on: pull_request: paths: - "src/transformers/models/*/modeling_*.py" - - "tests/models/*/test_*.py" + - "tests/**/test_*.py" concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} diff --git a/.github/workflows/trufflehog.yml b/.github/workflows/trufflehog.yml index 7dde5462240424..29a11e9354dbb1 100644 --- a/.github/workflows/trufflehog.yml +++ b/.github/workflows/trufflehog.yml @@ -10,20 +10,9 @@ jobs: trufflehog: runs-on: ubuntu-latest steps: - - shell: bash - run: | - if [ "${{ github.event_name }}" == "push" ]; then - echo "depth=$(($(jq length <<< '${{ toJson(github.event.commits) }}') + 2))" >> $GITHUB_ENV - echo "branch=${{ github.ref_name }}" >> $GITHUB_ENV - fi - if [ "${{ github.event_name }}" == "pull_request" ]; then - echo "depth=$((${{ github.event.pull_request.commits }}+2))" >> $GITHUB_ENV - echo "branch=${{ github.event.pull_request.head.ref }}" >> $GITHUB_ENV - fi - - name: Checkout code - uses: actions/checkout@v4 - with: - ref: ${{env.branch}} - fetch-depth: ${{env.depth}} - - name: Secret Scanning - uses: trufflesecurity/trufflehog@main + - name: Checkout code + uses: actions/checkout@v4 + with: + fetch-depth: 0 + - name: Secret Scanning + uses: trufflesecurity/trufflehog@main diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f96bcd9e9d2875..4d62a44ab250d5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -132,7 +132,7 @@ You will need basic `git` proficiency to contribute to manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro Git](https://git-scm.com/book/en/v2) is a very good reference. -You'll need **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** or above to contribute to 🤗 Transformers. Follow the steps below to start contributing: +You'll need **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L449)** or above to contribute to 🤗 Transformers. Follow the steps below to start contributing: 1. Fork the [repository](https://github.com/huggingface/transformers) by clicking on the **[Fork](https://github.com/huggingface/transformers/fork)** button on the repository's page. This creates a copy of the code @@ -341,12 +341,12 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_ne RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification ``` -Like the slow tests, there are other environment variables available which not enabled by default during testing: +Like the slow tests, there are other environment variables available which are not enabled by default during testing: - `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers. - `RUN_PT_FLAX_CROSS_TESTS`: Enables tests for PyTorch + Flax integration. - `RUN_PT_TF_CROSS_TESTS`: Enables tests for TensorFlow + PyTorch integration. -More environment variables and additional information can be found in the [testing_utils.py](src/transformers/testing_utils.py). +More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py). 🤗 Transformers uses `pytest` as a test runner only. It doesn't use any `pytest`-specific features in the test suite itself. diff --git a/Makefile b/Makefile index f9b2a8c9a7c620..cfa40b7bd6ee6e 100644 --- a/Makefile +++ b/Makefile @@ -56,6 +56,7 @@ quality: python utils/custom_init_isort.py --check_only python utils/sort_auto_mappings.py --check_only python utils/check_doc_toc.py + python utils/check_docstrings.py --check_all # Format source code automatically and check is there are any problems left that need manual fixing diff --git a/docker/consistency.dockerfile b/docker/consistency.dockerfile index c59e48bdd89d51..73436f8afca29f 100644 --- a/docker/consistency.dockerfile +++ b/docker/consistency.dockerfile @@ -8,7 +8,7 @@ RUN pip install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools RUN uv pip install --no-cache-dir --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu # tensorflow pin matching setup.py RUN uv pip install --no-cache-dir "tensorflow-cpu<2.16" "tf-keras<2.16" -RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,quality,vision,testing]" +RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,quality,speech,vision,testing]" RUN git lfs install RUN pip uninstall -y transformers diff --git a/docker/transformers-all-latest-gpu/Dockerfile b/docker/transformers-all-latest-gpu/Dockerfile index 378a65d1bf37b8..9c5e3c91415745 100644 --- a/docker/transformers-all-latest-gpu/Dockerfile +++ b/docker/transformers-all-latest-gpu/Dockerfile @@ -9,7 +9,7 @@ SHELL ["sh", "-lc"] # The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant # to be used as arguments for docker build (so far). -ARG PYTORCH='2.3.0' +ARG PYTORCH='2.4.0' # (not always a valid torch version) ARG INTEL_TORCH_EXT='2.3.0' # Example: `cu102`, `cu113`, etc. diff --git a/docker/transformers-pytorch-gpu/Dockerfile b/docker/transformers-pytorch-gpu/Dockerfile index c9f77a78ce9b83..2c1f153eef275e 100644 --- a/docker/transformers-pytorch-gpu/Dockerfile +++ b/docker/transformers-pytorch-gpu/Dockerfile @@ -11,7 +11,7 @@ ARG REF=main RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF # If set to nothing, will install the latest version -ARG PYTORCH='2.3.0' +ARG PYTORCH='2.4.0' ARG TORCH_VISION='' ARG TORCH_AUDIO='' # Example: `cu102`, `cu113`, etc. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index cc6ff752c7701e..93f2c96d2d9df0 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -92,11 +92,15 @@ title: Visual Question Answering - local: tasks/text-to-speech title: Text to speech + - local: tasks/image_text_to_text + title: Image-text-to-text title: Multimodal - isExpanded: false sections: - local: generation_strategies title: Customize the generation strategy + - local: kv_cache + title: Best Practices for Generation with Cache title: Generation - isExpanded: false sections: @@ -155,6 +159,8 @@ title: EETQ - local: quantization/hqq title: HQQ + - local: quantization/fbgemm_fp8 + title: FBGEMM_FP8 - local: quantization/optimum title: Optimum - local: quantization/contribute @@ -326,8 +332,6 @@ title: CamemBERT - local: model_doc/canine title: CANINE - - local: model_doc/chameleon - title: chameleon - local: model_doc/codegen title: CodeGen - local: model_doc/code_llama @@ -760,6 +764,8 @@ title: BridgeTower - local: model_doc/bros title: BROS + - local: model_doc/chameleon + title: Chameleon - local: model_doc/chinese_clip title: Chinese-CLIP - local: model_doc/clip diff --git a/docs/source/en/agents.md b/docs/source/en/agents.md index d1c550f5d32ea8..f335cb678faa60 100644 --- a/docs/source/en/agents.md +++ b/docs/source/en/agents.md @@ -509,3 +509,54 @@ agent = ReactCodeAgent(tools=[search_tool]) agent.run("How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need?") ``` + +## Gradio interface + +You can leverage `gradio.Chatbot`to display your agent's thoughts using `stream_to_gradio`, here is an example: + +```py +import gradio as gr +from transformers import ( + load_tool, + ReactCodeAgent, + HfEngine, + stream_to_gradio, +) + +# Import tool from Hub +image_generation_tool = load_tool("m-ric/text-to-image") + +llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct") + +# Initialize the agent with the image generation tool +agent = ReactCodeAgent(tools=[image_generation_tool], llm_engine=llm_engine) + + +def interact_with_agent(task): + messages = [] + messages.append(gr.ChatMessage(role="user", content=task)) + yield messages + for msg in stream_to_gradio(agent, task): + messages.append(msg) + yield messages + [ + gr.ChatMessage(role="assistant", content="⏳ Task not finished yet!") + ] + yield messages + + +with gr.Blocks() as demo: + text_input = gr.Textbox(lines=1, label="Chat Message", value="Make me a picture of the Statue of Liberty.") + submit = gr.Button("Run illustrator agent!") + chatbot = gr.Chatbot( + label="Agent", + type="messages", + avatar_images=( + None, + "https://em-content.zobj.net/source/twitter/53/robot-face_1f916.png", + ), + ) + submit.click(interact_with_agent, [text_input], [chatbot]) + +if __name__ == "__main__": + demo.launch() +``` \ No newline at end of file diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md index d840caaf660520..c4069dd1afc706 100644 --- a/docs/source/en/chat_templating.md +++ b/docs/source/en/chat_templating.md @@ -580,7 +580,7 @@ default template for that model class is used instead. Let's take a look at the >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") ->>> tokenizer.default_chat_template +>>> tokenizer.chat_template "{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}" ``` @@ -704,23 +704,6 @@ with other names, pass the name of the template you want to the `chat_template` We find that this can be a bit confusing for users, though - so if you're writing a template yourself, we recommend trying to put it all in a single template where possible! -### What are "default" templates? - -Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards -compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a -model does not have a chat template set, but there is a default template for its model class, the `TextGenerationPipeline` -class and methods like `apply_chat_template` will use the class template instead. You can find out what the default -template for your tokenizer is by checking the `tokenizer.default_chat_template` attribute. - -This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when -the class template is appropriate for your model, we strongly recommend overriding the default template by -setting the `chat_template` attribute explicitly to make it clear to users that your model has been correctly configured -for chat. - -Now that actual chat templates have been adopted more widely, default templates have been deprecated and will be -removed in a future release. We strongly recommend setting the `chat_template` attribute for any tokenizers that -still depend on them! - ### What template should I use? When setting the template for a model that's already been trained for chat, you should ensure that the template diff --git a/docs/source/en/conversations.md b/docs/source/en/conversations.md index 9336503ad7cb8c..a48c046b4949d7 100644 --- a/docs/source/en/conversations.md +++ b/docs/source/en/conversations.md @@ -195,7 +195,7 @@ inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()} print("Tokenized inputs:\n", inputs) # 4: Generate text from the model -outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.) +outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1) print("Generated tokens:\n", outputs) # 5: Decode the output back to a string diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md index 68430de643f17b..3a9392ddd07d9b 100644 --- a/docs/source/en/generation_strategies.md +++ b/docs/source/en/generation_strategies.md @@ -174,43 +174,6 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te ``` -## KV Cache Quantization - -The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value -cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models. -Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed. - -KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache] -(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper. - -To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`. -Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class. -One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`. - - - -Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization. - - - - -```python ->>> import torch ->>> from transformers import AutoTokenizer, AutoModelForCausalLM - ->>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") ->>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0") ->>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device) - ->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"}) ->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) -I like rock music because it's loud and energetic. It's a great way to express myself and rel - ->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20) ->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) -I like rock music because it's loud and energetic. I like to listen to it when I'm feeling -``` - ## Watermarking The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green". diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md index da7ea25e54b6b0..1172e32fd0cc5a 100644 --- a/docs/source/en/internal/generation_utils.md +++ b/docs/source/en/internal/generation_utils.md @@ -386,11 +386,24 @@ A [`Constraint`] can be used to force the generation to include specific tokens - get_seq_length - reorder_cache +[[autodoc]] OffloadedCache + - update + - prefetch_layer + - evict_previous_layer + [[autodoc]] StaticCache - update - get_seq_length - reset +[[autodoc]] HybridCache + - update + - reset + +[[autodoc]] SlidingWindowCache + - update + - reset + [[autodoc]] EncoderDecoderCache - get_seq_length - to_legacy_cache @@ -398,6 +411,11 @@ A [`Constraint`] can be used to force the generation to include specific tokens - reset - reorder_cache +[[autodoc]] MambaCache + - update_conv_state + - update_ssm_state + - reset + ## Watermark Utils [[autodoc]] WatermarkDetector diff --git a/docs/source/en/kv_cache.md b/docs/source/en/kv_cache.md new file mode 100644 index 00000000000000..c0ccc49d41e683 --- /dev/null +++ b/docs/source/en/kv_cache.md @@ -0,0 +1,346 @@ + + +# Best Practices for Generation with Cache + +Efficient caching is crucial for optimizing the performance of models in various generative tasks, +including text generation, translation, summarization and other transformer-based applications. +Effective caching helps reduce computation time and improve response rates, especially in real-time or resource-intensive applications. + +Transformers support various caching methods, leveraging "Cache" classes to abstract and manage the caching logic. +This document outlines best practices for using these classes to maximize performance and efficiency. +Check out all the available `Cache` classes in the [API documentation](./internal/generation_utils.md). + +## What is Cache and why we should care? + +Imagine you’re having a conversation with someone, and instead of remembering what was said previously, you have to start from scratch every time you respond. This would be slow and inefficient, right? In the world of Transformer models, a similar concept applies, and that's where Caching keys and values come into play. From now on, I'll refer to the concept as KV Cache. + +KV cache is needed to optimize the generation in autoregressive models, where the model predicts text token by token. This process can be slow since the model can generate only one token at a time, and each new prediction is dependent on the previous context. That means, to predict token number 1000 in the generation, you need information from the previous 999 tokens, which comes in the form of some matrix multiplications across the representations of those tokens. But to predict token number 1001, you also need the same information from the first 999 tokens, plus additional information from token number 1000. That is where key-value cache is used to optimize the sequential generation process by storing previous calculations to reuse in subsequent tokens, so they don't need to be computed again. + +More concretely, key-value cache acts as a memory bank for these generative models, where the model stores key-value pairs derived from self-attention layers for previously processed tokens. By storing this information, the model can avoid redundant computations and instead retrieve keys and values of previous tokens from the cache. + +
+ For the Curious Minds Who Like to Dive Deep + + ### Under the Hood: How Cache Object Works in Attention Mechanism + + When utilizing a cache object in the input, the Attention module performs several critical steps to integrate past and present information seamlessly. + + The Attention module concatenates the current key-values with the past key-values stored in the cache. This results in attention weights of shape `(new_tokens_length, past_kv_length + new_tokens_length)`. Essentially, the past and current key-values are combined to compute attention scores, ensuring that the model considers both previous context and new input. The concatenated key-values are used to compute the attention scores resulting in attention weights of shape `(new_tokens_length, past_kv_length + new_tokens_length)`. + + Therefore, when iteratively calling `forward()` instead of the `generate()` method, it’s crucial to ensure that the attention mask shape matches the combined length of past and current key-values. The attention mask should have the shape `(batch_size, past_kv_length + new_tokens_length)`. This is usually handled internally when you call `generate()` method. If you want to implement your own generation loop with Cache classes, take this into consideration and prepare the attention mask to hold values to current and past tokens. + + + + One important concept you need to know when writing your own generation loop, is `cache_position`. In case you want to reuse an already filled Cache object by calling `forward()`, you have to pass in a valid `cache_position` which will indicate the positions of inputs in the sequence. Note that `cache_position` is not affected by padding, and always adds one more position for each token. For example, if key/value cache contains 10 tokens (no matter how many of it is a pad token), the cache position for the next token should be `torch.tensor([10])`. + + + + + See an example below for how to implement your own generation loop. + + ```python + >>> import torch + >>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache + + >>> model_id = "meta-llama/Llama-2-7b-chat-hf" + >>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0") + >>> tokenizer = AutoTokenizer.from_pretrained(model_id) + + >>> past_key_values = DynamicCache() + >>> messages = [{"role": "user", "content": "Hello, what's your name."}] + >>> inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0") + + >>> generated_ids = inputs.input_ids + >>> cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.int64, device="cuda:0") + >>> max_new_tokens = 10 + + >>> for _ in range(max_new_tokens): + ... outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True) + ... # Greedily sample one next token + ... next_token_ids = outputs.logits[:, -1:].argmax(-1) + ... generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1) + ... + ... # Prepare inputs for the next generation step by leaaving unprocessed tokens, in our case we have only one new token + ... # and expanding attn mask for the new token, as explained above + ... attention_mask = inputs["attention_mask"] + ... attention_mask = torch.cat([attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1) + ... inputs = {"input_ids": next_token_ids, "attention_mask": attention_mask} + ... cache_position = cache_position[-1:] + 1 # add one more position for the next token + + >>> print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]) + "[INST] Hello, what's your name. [/INST] Hello! My name is LLaMA," + ``` + +
+ + + +## Generate with Cache + +In 🤗 Transformers, we support various Cache types to optimize the performance across different models and tasks. By default, all models generate with caching, +with the [`~DynamicCache`] class being the default cache for most models. It allows us to dynamically grow cache size, by saving more and more keys and values as we generate. If for some reason you don't want to use caches, you can pass `use_cache=False` into the `generate()` method. + +Refer to the table below to see the difference between cache types and choose the one that suits best for your use-case. + +| Cache Type | Memory Efficient | Supports torch.compile() | Initialization Recommended | Latency | Long Context Generation | +|---------------------|------------------|--------------------------|----------------------------|----------|--------------------------| +| Dynamic Cache | No | No | No | Mid | No | +| Static Cache | No | Yes | Yes | High | No | +| Quantized Cache | Yes | No | No | Low | Yes | +| Offloaded Cache | Yes | No | No | Low | No | +| Sliding Window Cache| No | Yes | Yes | High | No | +| Sink Cache | Yes | No | Yes | Mid | Yes | + + +These cache classes can be set with a `cache_implementation` argument when generating. To learn about the available options for the cache_implementation flag, please refer to the [API Documentation](./main_classes/text_generation.md#transformers.GenerationConfig). Now, let's explore each cache type in detail and see how to use them. Note that the below examples are for decoder-only Tranformer-based models. We also support ["Model-Specific Cache"] classes for models such as Mamba or Jamba, keep reading for more details. + +### Quantized Cache + +The key and value cache can occupy a large portion of memory, becoming a [bottleneck for long-context generation](https://huggingface.co/blog/llama31#inference-memory-requirements), especially for Large Language Models. +Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed. + +KV Cache quantization in `transformers` is largely inspired by the paper ["KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache"](https://arxiv.org/abs/2402.02750) and currently supports [`~QuantoQuantizedCache`] and [`~HQQQuantizedCache`] classes. For more information on the inner workings see the paper. + +To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`. +Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`~QuantizedCacheConfig`] class. +One has to indicate which quantization backend to use in the [`~QuantizedCacheConfig`], the default is `quanto`. + + + +Cache quantization can be detrimental in terms of latency if the context length is short and there is enough GPU VRAM available to run without cache quantization. It is recommended to seek balance between memory efficiency and latency. + + + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM + +>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") +>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0") +>>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device) + +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"}) +>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) +I like rock music because it's loud and energetic. It's a great way to express myself and rel + +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20) +>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) +I like rock music because it's loud and energetic. I like to listen to it when I'm feeling +``` + +## OffloadedCache + +Similarly to KV cache quantization, [`~OffloadedCache`] strategy aims to reduce GPU VRAM usage. +It does so by moving the KV cache for most layers to the CPU. +As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU. +At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU. +Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation. +Thus, it can serve as a drop-in replacement or a fallback for it. + +Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.) +you may notice a small degradation in generation throughput compared to the default KV cache implementation. + +To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config` or directky to the `generate()` call. + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM +>>> ckpt = "microsoft/Phi-3-mini-4k-instruct" + +>>> tokenizer = AutoTokenizer.from_pretrained(ckpt) +>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0") +>>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device) + +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded") +>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) +Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896. + +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23) +>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) +Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896. +``` + + + +Cache offloading requires a GPU and can be slower than dynamic KV cache. Use it if you are getting CUDA out of memory errors. + + + +The example below shows how KV cache offloading can be used as a fallback strategy. +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM +>>> def resilient_generate(model, *args, **kwargs): +... oom = False +... try: +... return model.generate(*args, **kwargs) +... except torch.cuda.OutOfMemoryError as e: +... print(e) +... print("retrying with cache_implementation='offloaded'") +... oom = True +... if oom: +... torch.cuda.empty_cache() +... kwargs["cache_implementation"] = "offloaded" +... return model.generate(*args, **kwargs) +... +... +>>> ckpt = "microsoft/Phi-3-mini-4k-instruct" +>>> tokenizer = AutoTokenizer.from_pretrained(ckpt) +>>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0") +>>> prompt = ["okay "*1000 + "Fun fact: The most"] +>>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device) +>>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, } +>>> out = resilient_generate(model, **inputs, **beams) +>>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True) +``` + +On a GPU with 50 GB of RAM, running this code will print +``` +CUDA out of memory. Tried to allocate 4.83 GiB. GPU +retrying with cache_implementation='offloaded' +``` +before successfully generating 40 beams. + + + +### Static Cache + +Since the "DynamicCache" dynamically grows with each generation step, it prevents you from taking advantage of JIT optimizations. The [`~StaticCache`] pre-allocates +a specific maximum size for the keys and values, allowing you to generate up to the maximum length without having to modify cache size. Check the below usage example. + +For more examples with Static Cache and JIT compilation, take a look at [StaticCache & torchcompile](./llm_optims.md#static-kv-cache-and-torchcompile) + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM + +>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") +>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto") +>>> inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device) + +>>> # simply pass the cache implementation="static" +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static") +>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] +"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of" +``` + +### Sliding Window Cache + +As the name suggests, this cache type implements a sliding window over previous keys and values, retaining only the last `sliding_window` tokens. It should be used with models like Mistral that support sliding window attention. Additionally, similar to Static Cache, this one is JIT-friendly and can be used with the same compile tecniques as Static Cache. + +Note that you can use this cache only for models that support sliding window, e.g. Mistral models. + + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16).to("cuda:0") +>>> inputs = tokenizer("Yesterday I was on a rock concert and.", return_tensors="pt").to(model.device) + +>>> # can be used by passing in cache implementation +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, cache_implementation="sliding_window") +>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] +"Yesterday I was on a rock concert and. I was so excited to see my favorite band. I was so excited that I was jumping up and down and screaming. I was so excited that I" +``` + +### Sink Cache + +Sink Cache was introduced in ["Efficient Streaming Language Models with Attention Sinks"](https://arxiv.org/abs/2309.17453). It allows you to generate long sequences of text ("infinite length" according to the paper) without any fine-tuning. That is achieved by smart handling of previous keys and values, specifically it retains a few initial tokens from the sequence, called "sink tokens". This is based on the observation that these initial tokens attract a significant portion of attention scores during the generation process. Tokens that come after "sink tokens" are discarded on a sliding windowed basis, keeping only the latest `window_size` tokens. By keeping these initial tokens as "attention sinks," the model maintains stable performance even when dealing with very long texts, thus discarding most of the previous knowledge. + +Unlike other cache classes, this one can't be used directly by indicating a `cache_implementation`. You have to initialize the Cache before calling on `generate()` as follows. + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModelForCausalLM, SinkCache + +>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") +>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0") +>>> inputs = tokenizer("This is a long story about unicorns, fairies and magic.", return_tensors="pt").to(model.device) + +>>> # get our cache, specify number of sink tokens and window size +>>> # Note that window size already includes sink tokens, so has to be larger +>>> past_key_values = SinkCache(window_length=256, num_sink_tokens=4) +>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=30, past_key_values=past_key_values) +>>> tokenizer.batch_decode(out, skip_special_tokens=True)[0] +"This is a long story about unicorns, fairies and magic. It is a fantasy world where unicorns and fairies live together in harmony. The story follows a young girl named Lily" +``` + +### Encoder-Decoder Cache + +The [`~EncoderDecoderCache`] is a wrapper designed to handle the caching needs of encoder-decoder models. This cache type is specifically built to manage both self-attention and cross-attention caches, ensuring storage and retrieval of past key/values required for these complex models. Cool thing about Encoder-Decoder Cache is that you can set different cache types for the encoder and for the decoder, depending on your use case. Currently this cache is only supported in [Whisper](./model_doc/whisper.md) models but we will be adding more models soon. + +In terms of usage, there is nothing special to be done and calling `generate()` or `forward()` will handle everything for you. + + +### Model-specific Cache Classes + +Some models require storing previous keys, values, or states in a specific way, and the above cache classes cannot be used. For such cases, we have several specialized cache classes that are designed for specific models. These models only accept their own dedicated cache classes and do not support using any other cache types. Some examples include [`~HybridCache`] for [Gemma2](./model_doc/gemma2.md) series models or [`~MambaCache`] for [Mamba](./model_doc/mamba.md) architecture models. + + +## Iterative Generation with Cache + +We have seen how to use each of the cache types when generating. What if you want to use cache in iterative generation setting, for example in applications like chatbots, where interactions involve multiple turns and continuous back-and-forth exchanges. Iterative generation with cache allows these systems to handle ongoing conversations effectively without reprocessing the entire context at each step. But there are some tips that you should know before you start implementing: + +The general format when doing iterative generation is as below. First you have to initialize an empty cache of the type you want, and you can start feeding in new prompts iteratively. Keeping track of dialogues history and formatting can be done with chat templates, read more on that in [chat_templating](./chat_templating.md) + +In case you are using Sink Cache, you have to crop your inputs to that maximum length because Sink Cache can generate text longer than its maximum window size, but it expects the first input to not exceed the maximum cache length. + + +```python +>>> import torch +>>> from transformers import AutoTokenizer,AutoModelForCausalLM +>>> from transformers.cache_utils import ( +>>> DynamicCache, +>>> SinkCache, +>>> StaticCache, +>>> SlidingWindowCache, +>>> QuantoQuantizedCache, +>>> QuantizedCacheConfig, +>>> ) + +>>> model_id = "meta-llama/Llama-2-7b-chat-hf" +>>> model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto') +>>> tokenizer = AutoTokenizer.from_pretrained(model_id) + +>>> user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."] + +>>> past_key_values = DynamicCache() +>>> max_cache_length = past_key_values.get_max_length() + +>>> messages = [] +>>> for prompt in user_prompts: +... messages.append({"role": "user", "content": prompt}) +... inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device) +... if isinstance(past_key_values, SinkCache): +... inputs = {k: v[:, -max_cache_length:] for k, v in inputs.items()} +... +... input_length = inputs["input_ids"].shape[1] +... +... outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, past_key_values=past_key_values) +... completion = tokenizer.decode(outputs[0, input_length: ], skip_special_tokens=True) +... messages.append({"role": "assistant", "content": completion}) + +print(messages) +[{'role': 'user', 'content': "Hello, what's your name?"}, {'role': 'assistant', 'content': " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. 😊"}, {'role': 'user', 'content': 'Btw, yesterday I was on a rock concert.'}, {'role': 'assistant', 'content': ' Oh, cool! That sounds like a lot of fun! 🎉 Did you enjoy the concert? What was the band like? 🤔'}] +``` + + +## Re-use Cache to continue generation + +Sometimes you would want to fist fill-in cache object with key/values for certain prefix prompt and re-use it several times to generate different sequences from it. We are working hard on adding this feature to 🤗 Transformers and will update this section soon. diff --git a/docs/source/en/llm_optims.md b/docs/source/en/llm_optims.md index 5e49f0e1ebd3ab..8e7e9c54d42a42 100644 --- a/docs/source/en/llm_optims.md +++ b/docs/source/en/llm_optims.md @@ -18,59 +18,109 @@ Basic inference is slow because LLMs have to be called repeatedly to generate th This guide will show you how to use the optimization techniques available in Transformers to accelerate LLM inference. > [!TIP] -> Hugging Face also provides [Text Generation Inference (TGI)](https://hf.co/docs/text-generation-inference), a library dedicated to deploying and serving highly optimized LLMs for inference. It includes more optimization features not included in Transformers, such as continuous batching for increasing throughput and tensor parallelism for multi-GPU inference. +> Hugging Face also provides [Text Generation Inference (TGI)](https://hf.co/docs/text-generation-inference), a library dedicated to deploying and serving highly optimized LLMs for inference. It includes deployment-oriented optimization features not included in Transformers, such as continuous batching for increasing throughput and tensor parallelism for multi-GPU inference. -## Static kv-cache and torch.compile +## Static kv-cache and `torch.compile` During decoding, a LLM computes the key-value (kv) values for each input token and since it is autoregressive, it computes the same kv values each time because the generated output becomes part of the input now. This is not very efficient because you're recomputing the same kv values each time. -To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [torch.compile](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. +To optimize this, you can use a kv-cache to store the past keys and values instead of recomputing them each time. However, since the kv-cache grows with each generation step and is dynamic, it prevents you from taking advantage of [`torch.compile`](./perf_torch_compile), a powerful optimization tool that fuses PyTorch code into fast and optimized kernels. -The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with torch.compile for up to a 4x speed up. +The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with `torch.compile` for up to a 4x speed up. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware. > [!WARNING] -> Currently, only [Llama](./model_doc/llama2) and a few other models support static kv-cache and torch.compile. Check [this issue](https://github.com/huggingface/transformers/issues/28981) for a live model compatibility list. +> Currently, only [Llama](./model_doc/llama2) and a few other models support static kv-cache and `torch.compile`. Check [this issue](https://github.com/huggingface/transformers/issues/28981) for a live model compatibility list. -For this example, let's load the [Gemma](https://hf.co/google/gemma-2b) model. +There are three flavors of static kv-cache usage, depending on the complexity of your task: +1. Basic usage: simply set a flag in `generation_config` (recommended); +2. Advanced usage: handle a cache object for multi-turn generation or a custom generation loop; +3. Advanced usage: compile the entire `generate` function into a single graph, if having a single graph is relevant for you. + +Select the correct tab below for further instructions on each of these flavors. + +> [!TIP] +> Regardless of the strategy used with `torch.compile`, you can avoid shape-related recompilations if you left-pad your LLM inputs to a limited set of values. The [`pad_to_multiple_of` tokenizer flag](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.pad_to_multiple_of) is your friend! + + + + +For this example, let's use the [Gemma](https://hf.co/google/gemma-2b) model. All we need to do is to: +1. Access the model's `generation_config` attribute and set the `cache_implementation` to "static"; +2. Call `torch.compile` on the model to compile the forward pass with the static kv-cache. + +And that's it! ```py from transformers import AutoTokenizer, AutoModelForCausalLM +import torch +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") -model = AutoModelForCausalLM.from_pretrained( - "google/gemma-2b", device_map="auto" -) +model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") + +model.generation_config.cache_implementation = "static" + +model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) +input_text = "The theory of special relativity states " +input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") + +outputs = model.generate(**input_ids) +print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['The theory of special relativity states 1. The speed of light is constant in all inertial reference'] ``` -There are two ways you can configure the model to use a static kv-cache. For a 7B model on an A100, both methods get a 4x speed up in the forward pass. Your speed up may vary depending on the model size (larger models have a smaller speed up) and hardware. If you're using the [`~GenerationMixin.generate`] method, the speed up is ~3x. The forward pass (which still gets 4x speed up) is only a part of the whole [`~GenerationMixin.generate`] code. +Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. Avoiding re-compilation is critical to get the most out of `torch.compile`, and you should be aware of the following: +1. If the batch size changes or the maximum output length increases between calls, the cache will have to be reinitialized, triggering a new compilation; +2. The first couple of calls of the compiled function are slower, as the function is being compiled. - - +> [!WARNING] +> For a more advanced usage of the static cache, such as multi-turn conversations, we recommend instantiating and manipulating the cache object outside [`~GenerationMixin.generate`]. See the advanced usage tab. + + + -Access the model's `generation_config` attribute and set the `cache_implementation` to "static". +A [`StaticCache`] object can be passed to the model's [`~GenerationMixin.generate`] under the `past_key_values` argument. The object will retain the cache contents, so you can pass it to a new [`~GenerationMixin.generate`] call to continue generation, like you would do with a dynamic cache. ```py -model.generation_config.cache_implementation = "static" -``` +from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache +import torch +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) -Call torch.compile on the model to compile the forward pass with the static kv-cache. +tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") +model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") -```py -compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True) +model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) input_text = "The theory of special relativity states " input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") +prompt_length = input_ids.input_ids.shape[1] +model.generation_config.max_new_tokens = 16 + +past_key_values = StaticCache( + config=model.config, + max_batch_size=1, + # If you plan to reuse the cache, make sure the cache length is large enough for all cases + max_cache_len=prompt_length+(model.generation_config.max_new_tokens*2), + device=model.device, + dtype=model.dtype +) +outputs = model.generate(**input_ids, past_key_values=past_key_values) +print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2'] -outputs = compiled_model.generate(**input_ids) -tokenizer.batch_decode(outputs, skip_special_tokens=True) -['The theory of special relativity states 1. The speed of light is constant in all inertial reference'] +# pass in the generated text and the same cache object to continue generation from where it left off. Optionally, in a +# multi-turn conversation, append the new user input to the generated text. +new_input_ids = outputs +outputs = model.generate(new_input_ids, past_key_values=past_key_values) +print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['The theory of special relativity states 1. The speed of light is constant in all inertial reference frames. 2. The speed of light is constant in all inertial reference frames. 3.'] ``` -Under the hood, `generate` will attempt to reuse the same cache object, removing the need for re-compilation at each call. However, if the batch size or the maximum output length increase between calls, the cache will have to be reinitialized, triggering a new compilation. - - - +> [!TIP] +> If you want to reuse the same [`StaticCache`] object on a new prompt, be sure to reset its contents with the `.reset()` method between calls -A [`StaticCache`] object can be passed to the model's forward pass under the `past_key_values` argument, enabling the use of this object as a static kv-cache. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. You can also pass the [`StaticCache`] object to [`~GenerationMixin.generate`] and use it across calls, like you would do with a dynamic cache. +If you want to go further down a level, the [`StaticCache`] object can also be passed to the model's forward pass under the same `past_key_values` argument. Using this strategy, you can write your own function to decode the next token given the current token and position and cache position of previously generated tokens. ```py from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging @@ -102,12 +152,9 @@ def decode_one_tokens(model, cur_token, input_pos, cache_position, past_key_valu return new_token ``` -There are a few important things you must do to enable static kv-cache and torch.compile with the `StaticCache` method: - +There are a few important things you must do to enable static kv-cache and `torch.compile` with the `StaticCache` method: 1. Initialize the [`StaticCache`] instance before using the model for inference. There you can configure parameters like the maximum batch size and sequence length. - -2. Call torch.compile on the model to compile the forward pass with the static kv-cache. - +2. Call `torch.compile` on the model to compile the forward pass with the static kv-cache. 3. Set `enable_math=True` in the [torch.backends.cuda.sdp_kernel](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) context manager to enable the native PyTorch C++ implementation of scaled dot product attention to speed up inference even more. ```py @@ -142,8 +189,34 @@ text 'My favorite all time favorite condiment is ketchup. I love it on everything. I love it on my eggs, my fries, my chicken, my burgers, my hot dogs, my sandwiches, my salads, my p'] ``` -> [!TIP] -> If you want to reuse the [`StaticCache`] object on a new prompt, be sure to reset its contents with the `.reset()` method + + + +Compiling the entire `generate` function, in terms of code, is even simpler than in the basic usage: call `torch.compile` on `generate` to compile the entire function. No need to specify the use of the static cache: although it is compatible, dynamic cache (default) was faster in our benchmarks. + +```py +from transformers import AutoTokenizer, AutoModelForCausalLM +import torch +import os +os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :) + +tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") +model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto") + +model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True) +input_text = "The theory of special relativity states " +input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") + +outputs = model.generate(**input_ids) +print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['The theory of special relativity states 1. The speed of light is constant in all inertial reference'] +``` + +As a result, we compile not only the model forward pass, but also all input preparation, logit processor operations, and so on. The result should be a slightly `generate` call, compared to the basic usage example, and the compiled graph may be better suited to more exotic hardware devices or use cases. However, there are severe drawbacks in using this approach: +1. Compilation is much slower; +2. All parameterization of `generate` must be done through `generation_config`; +3. Many warnings and exceptions are suppressed -- we suggest testing with its uncompiled form first; +4. Although we are working on it, it is heavily feature restricted (for instance, at the time of writing, generation does not stop if an EOS token is selected). diff --git a/docs/source/en/main_classes/agent.md b/docs/source/en/main_classes/agent.md index 8376fb36486c7c..444003615ba4f1 100644 --- a/docs/source/en/main_classes/agent.md +++ b/docs/source/en/main_classes/agent.md @@ -72,6 +72,10 @@ We provide two types of agents, based on the main [`Agent`] class: [[autodoc]] launch_gradio_demo +### stream_to_gradio + +[[autodoc]] stream_to_gradio + ### ToolCollection [[autodoc]] ToolCollection diff --git a/docs/source/en/main_classes/backbones.md b/docs/source/en/main_classes/backbones.md index efea7eb32a84c8..5f1fc1dcbe1f20 100644 --- a/docs/source/en/main_classes/backbones.md +++ b/docs/source/en/main_classes/backbones.md @@ -25,11 +25,11 @@ A backbone is a model used for feature extraction for higher level computer visi Backbones are supported for the following models: -* [BEiT](..model_doc/beit) +* [BEiT](../model_doc/beit) * [BiT](../model_doc/bit) -* [ConvNet](../model_doc/convnext) +* [ConvNext](../model_doc/convnext) * [ConvNextV2](../model_doc/convnextv2) -* [DiNAT](..model_doc/dinat) +* [DiNAT](../model_doc/dinat) * [DINOV2](../model_doc/dinov2) * [FocalNet](../model_doc/focalnet) * [MaskFormer](../model_doc/maskformer) diff --git a/docs/source/en/main_classes/data_collator.md b/docs/source/en/main_classes/data_collator.md index 74e653dd1185e9..e704bb747fe6e0 100644 --- a/docs/source/en/main_classes/data_collator.md +++ b/docs/source/en/main_classes/data_collator.md @@ -66,3 +66,8 @@ Examples of use can be found in the [example scripts](../examples) or [example n - numpy_mask_tokens - tf_mask_tokens - torch_mask_tokens + +## DataCollatorWithFlattening + +[[autodoc]] data.data_collator.DataCollatorWithFlattening + diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md index f1e2acdcfe4809..fc5808415cbe5f 100755 --- a/docs/source/en/main_classes/quantization.md +++ b/docs/source/en/main_classes/quantization.md @@ -56,3 +56,8 @@ Learn how to quantize models in the [Quantization](../quantization) guide. ## HqqConfig [[autodoc]] HqqConfig + +## FbgemmFp8Config + +[[autodoc]] FbgemmFp8Config + diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md index e2a0012ba97f2c..323b83813160b0 100644 --- a/docs/source/en/model_doc/chameleon.md +++ b/docs/source/en/model_doc/chameleon.md @@ -34,13 +34,13 @@ being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and -text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents* +text. Chameleon marks a significant step forward in unified modeling of full multimodal documents* drawing - Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image geenration using an auto-regressive transformer. Taken from the original paper. + Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from the original paper. This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/facebookresearch/chameleon). @@ -55,27 +55,28 @@ The original code can be found [here](https://github.com/facebookresearch/chamel - Chameleon generates in chat format which means that the generated text will always be the "assistant's turn". You can enable a text completion generation by passing `return_for_text_completion=True` when calling the processor. > [!NOTE] -> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: ``. +> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: ``. You have to add `` to your prompt in the place where the image should be embedded for correct generation. ## Usage example ### Single image inference -Here's how to load the model and perform inference in half-precision (`torch.float16`): +Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token. +Here's how to load the model and perform inference in half-precision (`torch.bfloat16`): ```python -from transformers import ChameleonProcessor, ChameleonForCausalLM +from transformers import ChameleonProcessor, ChameleonForConditionalGeneration import torch from PIL import Image import requests -processor = ChameleonProcessor.from_pretrained("meta-chameleon") -model = ChameleonForCausalLM.from_pretrained("meta-chameleon", torch_dtype=torch.float16, device_map="auto") +processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b") +model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.bfloat16, device_map="cuda") # prepare image and text prompt -url = "https://bjiujitsu.com/wp-content/uploads/2021/01/jiu_jitsu_belt_white_1.jpg" +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) -prompt = "What color is the belt in this image?" +prompt = "What do you see in this image?" inputs = processor(prompt, image, return_tensors="pt").to(model.device) @@ -89,13 +90,14 @@ print(processor.decode(output[0], skip_special_tokens=True)) Chameleon can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it: ```python -from transformers import ChameleonProcessor, ChameleonForCausalLM +from transformers import ChameleonProcessor, ChameleonForConditionalGeneration import torch from PIL import Image import requests -processor = ChameleonProcessor.from_pretrained("meta-chameleon") -model = ChameleonForCausalLM.from_pretrained("meta-chameleon", torch_dtype=torch.float16, device_map="auto") +processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b") + +model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.bfloat16, device_map="cuda") # Get three different images url = "https://www.ilankelman.org/stopsigns/australia.jpg" @@ -115,7 +117,7 @@ prompts = [ # We can simply feed images in the order they have to be used in the text prompt # Each "" token uses one image leaving the next for the subsequent "" tokens -inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device) +inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(device="cuda", dtype=torch.bfloat16) # Generate generate_ids = model.generate(**inputs, max_new_tokens=50) @@ -129,16 +131,16 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with: ```python -from transformers import ChameleonForCausalLM, BitsAndBytesConfig +from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig # specify how to quantize the model quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", - bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_compute_dtype=torch.bfloat16, ) -model = ChameleonForCausalLM.from_pretrained("meta-chameleon", quantization_config=quantization_config, device_map="auto") +model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="cuda") ``` ### Use Flash-Attention 2 and SDPA to further speed-up generation @@ -146,11 +148,12 @@ model = ChameleonForCausalLM.from_pretrained("meta-chameleon", quantization_conf The models supports both, Flash-Attention 2 and PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) which can be enables for optimization. SDPA is the default options when you load the model, If you want to switch for Flash Attention 2, first make sure to install flash-attn. Refer to the [original repository](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with: ```python -from transformers import ChameleonForCausalLM +from transformers import ChameleonForConditionalGeneration -model = ChameleonForCausalLM.from_pretrained( +model_id = "facebook/chameleon-7b" +model = ChameleonForConditionalGeneration.from_pretrained( model_id, - torch_dtype=torch.float16, + torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, attn_implementation="flash_attention_2" ).to(0) @@ -183,7 +186,7 @@ model = ChameleonForCausalLM.from_pretrained( [[autodoc]] ChameleonModel - forward -## ChameleonForCausalLM +## ChameleonForConditionalGeneration -[[autodoc]] ChameleonForCausalLM +[[autodoc]] ChameleonForConditionalGeneration - forward diff --git a/docs/source/en/model_doc/clip.md b/docs/source/en/model_doc/clip.md index 692ea083717c42..f0829f484aaa51 100644 --- a/docs/source/en/model_doc/clip.md +++ b/docs/source/en/model_doc/clip.md @@ -79,6 +79,123 @@ encode the text and prepare the images. The following example shows how to get t >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities ``` + +### Combining CLIP and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16`) + + + +For small batch sizes, you might notice a slowdown in your model when using flash attention. Refer to the section [Expected speedups with Flash Attention and SDPA](#Expected-speedups-with-Flash-Attention-and-SDPA) below and select an appropriate attention implementation. + + + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> import requests +>>> from PIL import Image + +>>> from transformers import CLIPProcessor, CLIPModel + +>>> device = "cuda" +>>> torch_dtype = torch.float16 + +>>> model = CLIPModel.from_pretrained( +... "openai/clip-vit-base-patch32", +... attn_implementation="flash_attention_2", +... device_map=device, +... torch_dtype=torch_dtype, +... ) +>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) +>>> inputs.to(device) + +>>> with torch.no_grad(): +... with torch.autocast(device): +... outputs = model(**inputs) + +>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score +>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities +>>> print(probs) +tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16) +``` + + +### Using Scaled Dot Product Attention (SDPA) + +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) +page for more information. + +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. + +```python +from transformers import CLIPModel + +model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa") +``` + +For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`). + +### Expected speedups with Flash Attention and SDPA + +On a local benchmark (NVIDIA A10G, PyTorch 2.3.1+cu121) with `float16`, we saw the following speedups during inference for `"openai/clip-vit-large-patch14"` checkpoint ([code](https://gist.github.com/qubvel/ac691a54e54f9fae8144275f866a7ff8)): + +#### CLIPTextModel + +| Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | +|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| +| 4 | 0.009 | 0.012 | 0.737 | 0.007 | 1.269 | +| 16 | 0.009 | 0.014 | 0.659 | 0.008 | 1.187 | +| 32 | 0.018 | 0.021 | 0.862 | 0.016 | 1.142 | +| 64 | 0.034 | 0.034 | 1.001 | 0.03 | 1.163 | +| 128 | 0.063 | 0.058 | 1.09 | 0.054 | 1.174 | + +![clip_text_model_viz_3](https://github.com/user-attachments/assets/e9826b43-4e66-4f4c-952b-af4d90bd38eb) + +#### CLIPVisionModel + +| Image batch size | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | +|-------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| +| 1 | 0.016 | 0.013 | 1.247 | 0.012 | 1.318 | +| 4 | 0.025 | 0.021 | 1.198 | 0.021 | 1.202 | +| 16 | 0.093 | 0.075 | 1.234 | 0.075 | 1.24 | +| 32 | 0.181 | 0.147 | 1.237 | 0.146 | 1.241 | + +![clip_image_model_viz_3](https://github.com/user-attachments/assets/50a36206-e3b9-4adc-ac8e-926b8b071d63) + +#### CLIPModel + +| Image batch size | Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | +|-------------------:|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| +| 1 | 4 | 0.025 | 0.026 | 0.954 | 0.02 | 1.217 | +| 1 | 16 | 0.026 | 0.028 | 0.918 | 0.02 | 1.287 | +| 1 | 64 | 0.042 | 0.046 | 0.906 | 0.036 | 1.167 | +| 4 | 4 | 0.028 | 0.033 | 0.849 | 0.024 | 1.189 | +| 4 | 16 | 0.034 | 0.035 | 0.955 | 0.029 | 1.169 | +| 4 | 64 | 0.059 | 0.055 | 1.072 | 0.05 | 1.179 | +| 16 | 4 | 0.096 | 0.088 | 1.091 | 0.078 | 1.234 | +| 16 | 16 | 0.102 | 0.09 | 1.129 | 0.083 | 1.224 | +| 16 | 64 | 0.127 | 0.11 | 1.157 | 0.105 | 1.218 | +| 32 | 4 | 0.185 | 0.159 | 1.157 | 0.149 | 1.238 | +| 32 | 16 | 0.19 | 0.162 | 1.177 | 0.154 | 1.233 | +| 32 | 64 | 0.216 | 0.181 | 1.19 | 0.176 | 1.228 | + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP. diff --git a/docs/source/en/model_doc/dinov2.md b/docs/source/en/model_doc/dinov2.md index dca94786773d1d..e8f7c08cbfc44b 100644 --- a/docs/source/en/model_doc/dinov2.md +++ b/docs/source/en/model_doc/dinov2.md @@ -57,7 +57,7 @@ print((last_hidden_states - traced_outputs[0]).abs().max()) ## Resources -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT. +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DINOv2. - Demo notebooks for DINOv2 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2). 🌎 diff --git a/docs/source/en/model_doc/grounding-dino.md b/docs/source/en/model_doc/grounding-dino.md index d258f492abf8b5..a6da554f8d5053 100644 --- a/docs/source/en/model_doc/grounding-dino.md +++ b/docs/source/en/model_doc/grounding-dino.md @@ -41,33 +41,40 @@ The original code can be found [here](https://github.com/IDEA-Research/Grounding Here's how to use the model for zero-shot object detection: ```python -import requests - -import torch -from PIL import Image -from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection, - -model_id = "IDEA-Research/grounding-dino-tiny" - -processor = AutoProcessor.from_pretrained(model_id) -model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device) - -image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" -image = Image.open(requests.get(image_url, stream=True).raw) -# Check for cats and remote controls -text = "a cat. a remote control." - -inputs = processor(images=image, text=text, return_tensors="pt").to(device) -with torch.no_grad(): - outputs = model(**inputs) - -results = processor.post_process_grounded_object_detection( - outputs, - inputs.input_ids, - box_threshold=0.4, - text_threshold=0.3, - target_sizes=[image.size[::-1]] -) +>>> import requests + +>>> import torch +>>> from PIL import Image +>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection + +>>> model_id = "IDEA-Research/grounding-dino-tiny" +>>> device = "cuda" + +>>> processor = AutoProcessor.from_pretrained(model_id) +>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device) + +>>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(image_url, stream=True).raw) +>>> # Check for cats and remote controls +>>> text = "a cat. a remote control." + +>>> inputs = processor(images=image, text=text, return_tensors="pt").to(device) +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> results = processor.post_process_grounded_object_detection( +... outputs, +... inputs.input_ids, +... box_threshold=0.4, +... text_threshold=0.3, +... target_sizes=[image.size[::-1]] +... ) +>>> print(results) +[{'boxes': tensor([[344.6959, 23.1090, 637.1833, 374.2751], + [ 12.2666, 51.9145, 316.8582, 472.4392], + [ 38.5742, 70.0015, 176.7838, 118.1806]], device='cuda:0'), + 'labels': ['a cat', 'a cat', 'a remote control'], + 'scores': tensor([0.4785, 0.4381, 0.4776], device='cuda:0')}] ``` ## Grounded SAM diff --git a/docs/source/en/model_doc/hiera.md b/docs/source/en/model_doc/hiera.md index 24bf1639fe1400..9bd2816e7a99fc 100644 --- a/docs/source/en/model_doc/hiera.md +++ b/docs/source/en/model_doc/hiera.md @@ -26,8 +26,22 @@ The abstract from the paper is the following: *Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.* + + + Hiera architecture. Taken from the original paper. + This model was a joint contibution by [EduardoPacheco](https://huggingface.co/EduardoPacheco) and [namangarg110](https://huggingface.co/namangarg110). The original code can be found [here] (https://github.com/facebookresearch/hiera). +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Hiera. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + +- [`HieraForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) + ## HieraConfig [[autodoc]] HieraConfig diff --git a/docs/source/en/model_doc/llava-next-video.md b/docs/source/en/model_doc/llava-next-video.md index 88e41efc29c87c..48e50f950621e8 100644 --- a/docs/source/en/model_doc/llava-next-video.md +++ b/docs/source/en/model_doc/llava-next-video.md @@ -43,6 +43,13 @@ The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tre - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. + + +- Llava-Next uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding". + + + + - Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use tokenizer's `apply_chat_template` to format your prompts correctly. Below is an example of how to do that. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. Each content field has to be a list of dicts, as follows: diff --git a/docs/source/en/model_doc/llava.md b/docs/source/en/model_doc/llava.md index 43eaa41d5d7140..a7e4b4da7f3c5a 100644 --- a/docs/source/en/model_doc/llava.md +++ b/docs/source/en/model_doc/llava.md @@ -40,7 +40,42 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/ - Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. -- For better results, we recommend users to prompt the model with the correct prompt format. Below is a list of prompt formats accepted by each llava checkpoint: +- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows: + +```python +from transformers import AutoProcessor + +processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf") + +conversation = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What’s shown in this image?"}, + ], + }, + { + "role": "assistant", + "content": [{"type": "text", "text": "This image shows a red stop sign."},] + }, + { + + "role": "user", + "content": [ + {"type": "text", "text": "Describe the image in more details."}, + ], + }, +] + +text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) + +# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images +print(text_prompt) +>>> "USER: \nUSER: Describe the image in more details. ASSISTANT:" +``` + +- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint: [llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format: ```bash @@ -64,6 +99,7 @@ For multiple turns conversation: "USER: \n ASSISTANT: USER: ASSISTANT: USER: ASSISTANT:" ``` + ### Using Flash Attention 2 Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one). diff --git a/docs/source/en/model_doc/llava_next.md b/docs/source/en/model_doc/llava_next.md index a4a1419ee00ac8..d0558be76467a2 100644 --- a/docs/source/en/model_doc/llava_next.md +++ b/docs/source/en/model_doc/llava_next.md @@ -46,26 +46,79 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/ - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. -- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. Below, we list the correct prompt formats to use for the text prompt "What is shown in this image?": + -[llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) requires the following format: +- Llava-Next uses different number of patches for images and thus has to pad the inputs inside modeling code, aside from the padding done when processing the inputs. The default setting is "left-padding" if model is in `eval()` mode, otherwise "right-padding". + + + + +- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that and the list of formats accepted by each checkpoint. +We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows: + +```python +from transformers import LlavaNextProcessor + +processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") + +conversation = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What’s shown in this image?"}, + ], + }, + { + "role": "assistant", + "content": [{"type": "text", "text": "This image shows a red stop sign."},] + }, + { + + "role": "user", + "content": [ + {"type": "text", "text": "Describe the image in more details."}, + ], + }, +] + +text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) + +# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images +print(text_prompt) +>>> "[INST] \nWhat's shown in this image? [/INST] This image shows a red stop sign. [INST] Describe the image in more details. [/INST]" +``` + +- If you want to construct a chat prompt yourself, below is a list of possible formats +. +[llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) requires the following format: ```bash "[INST] \nWhat is shown in this image? [/INST]" ``` [llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) and [llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) require the following format: - ```bash "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: \nWhat is shown in this image? ASSISTANT:" ``` [llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) requires the following format: - ```bash "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n" ``` +[llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llava-next-8b-hf) requires the following format: + +```bash +"<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n\nWhat is shown in this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" +``` + +[llava-next-72b-hf](https://huggingface.co/llava-hf/llava-next-72b-hf) and [llava-next-110b-hf](https://huggingface.co/llava-hf/llava-next-110b-hf) require the following format: + +```bash +"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n" +``` + ## Usage example ### Single image inference @@ -86,8 +139,17 @@ model.to("cuda:0") # prepare image and text prompt, using the appropriate prompt template url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true" image = Image.open(requests.get(url, stream=True).raw) -prompt = "[INST] \nWhat is shown in this image? [/INST]" +conversation = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What is shown in this image?"}, + ], + }, +] +prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") # autoregressively complete prompt @@ -120,15 +182,47 @@ image_cats = Image.open(requests.get(url, stream=True).raw) url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" image_snowman = Image.open(requests.get(url, stream=True).raw) -# Prepare a batched prompt, where the first one is a multi-turn conversation and the second is not -prompt = [ - "[INST] \nWhat is shown in this image? [/INST] There is a red stop sign in the image. [INST] \nWhat about this image? How many cats do you see [/INST]", - "[INST] \nWhat is shown in this image? [/INST]" +# Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not +conversation_1 = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What is shown in this image?"}, + ], + }, + { + "role": "assistant", + "content": [ + {"type": "text", "text": "There is a red stop sign in the image."}, + ], + }, + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What about this image? How many cats do you see?"}, + ], + }, ] +conversation_2 = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "What is shown in this image?"}, + ], + }, +] + +prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True) +prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True) +prompts = [prompt_1, prompt_2] + # We can simply feed images in the order they have to be used in the text prompt # Each "" token uses one image leaving the next for the subsequent "" tokens -inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device) +inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device) # Generate generate_ids = model.generate(**inputs, max_new_tokens=30) diff --git a/docs/source/en/model_doc/marian.md b/docs/source/en/model_doc/marian.md index 8078ea1427c952..d8ebec8ffb0ad2 100644 --- a/docs/source/en/model_doc/marian.md +++ b/docs/source/en/model_doc/marian.md @@ -105,7 +105,7 @@ from huggingface_hub import list_models model_list = list_models() org = "Helsinki-NLP" -model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)] +model_ids = [x.id for x in model_list if x.id.startswith(org)] suffix = [x.split("/")[1] for x in model_ids] old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()] ``` diff --git a/docs/source/en/model_doc/qwen2.md b/docs/source/en/model_doc/qwen2.md index ac0e25e02c35f9..16815f2fc1f3cd 100644 --- a/docs/source/en/model_doc/qwen2.md +++ b/docs/source/en/model_doc/qwen2.md @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer. ## Overview -Qwen2 is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc. +Qwen2 is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B, Qwen2-Audio, etc. ### Model Details @@ -27,16 +27,16 @@ Qwen2 is a language model series including decoder language models of different ## Usage tips -`Qwen2-7B-beta` and `Qwen2-7B-Chat-beta` can be found on the [Huggingface Hub](https://huggingface.co/Qwen) +`Qwen2-7B` and `Qwen2-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen) -In the following, we demonstrate how to use `Qwen2-7B-Chat-beta` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose. +In the following, we demonstrate how to use `Qwen2-7B-Instruct` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose. ```python >>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> device = "cuda" # the device to load the model onto ->>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-7B-Chat", device_map="auto") ->>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat") +>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", device_map="auto") +>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct") >>> prompt = "Give me a short introduction to large language model." diff --git a/docs/source/en/model_doc/roberta.md b/docs/source/en/model_doc/roberta.md index 364b5b37e5f3f0..2a1843d8885abe 100644 --- a/docs/source/en/model_doc/roberta.md +++ b/docs/source/en/model_doc/roberta.md @@ -51,19 +51,19 @@ This model was contributed by [julien-c](https://huggingface.co/julien-c). The o ## Usage tips -- This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup - for Roberta pretrained models. -- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a +- This implementation is the same as [`BertModel`] with a minor tweak to the embeddings, as well as a setup + for RoBERTa pretrained models. +- RoBERTa has the same architecture as BERT but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme. -- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just - separate your segments with the separation token `tokenizer.sep_token` (or ``) -- Same as BERT with better pretraining tricks: - - * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all - * together to reach 512 tokens (so the sentences are in an order than may span several documents) - * train with larger batches - * use BPE with bytes as a subunit and not characters (because of unicode characters) -- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples. +- RoBERTa doesn't have `token_type_ids`, so you don't need to indicate which token belongs to which segment. Just + separate your segments with the separation token `tokenizer.sep_token` (or ``). +- RoBERTa is similar to BERT but with better pretraining techniques: + + * Dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all. + * Sentence packing: Sentences are packed together to reach 512 tokens (so the sentences are in an order that may span several documents). + * Larger batches: Training uses larger batches. + * Byte-level BPE vocabulary: Uses BPE with bytes as a subunit instead of characters, accommodating Unicode characters. +- [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to its model page for usage examples. ## Resources diff --git a/docs/source/en/model_doc/video_llava.md b/docs/source/en/model_doc/video_llava.md index 307c55bb2cef63..f098e82a177670 100644 --- a/docs/source/en/model_doc/video_llava.md +++ b/docs/source/en/model_doc/video_llava.md @@ -98,7 +98,7 @@ indices = np.arange(0, total_frames, total_frames / 8).astype(int) video = read_video_pyav(container, indices) # For better results, we recommend to prompt the model in the following format -prompt = "USER: